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ABSTRACT 

Klinvex,  Alicia  Marie  Ph.D.,  Purdue  University,  May  2015.  Parallel  Symmetric  Eigen¬ 
value  Problem  Solvers.  Major  Professors:  Ahmed  Sameh. 

Sparse  symmetric  eigenvalue  problems  arise  in  many  computational  science  and 
engineering  applications:  in  structural  mechanics,  nanoelectronics,  and  spectral  re¬ 
ordering,  for  example.  Often,  the  large  size  of  these  problems  requires  the  develop¬ 
ment  of  eigensolvers  that  scale  well  on  parallel  computing  platforms.  In  this  thesis,  I 
will  describe  two  such  eigensolvers,  TraceMin  and  TraceMin-Davidson.  These  meth¬ 
ods  are  different  from  many  other  eigensolvers  in  that  they  do  not  require  accurate 
linear  solves  to  be  performed  at  each  iteration  in  order  to  hnd  the  smallest  eigenvalues 
and  their  associated  eigenvectors.  After  introducing  these  closely  related  eigensolvers, 
I  will  discuss  alternative  methods  for  solving  the  saddle  point  problems  arising  at  each 
iteration,  which  can  improve  the  overall  running  time.  Additionally,  I  will  present 
a  new  TraceMin  implementation  geared  towards  Ending  large  numbers  of  eigenpairs 
in  any  given  interval  of  the  spectrum,  TraceMin-Multisectioning.  I  will  conclude 
with  numerical  experiments  comparing  my  trace-minimization  solvers  to  other  pop¬ 
ular  eigensolvers  (such  as  Krylov-Schur,  LOBPCG,  Jacobi-Davidson,  and  FEAST), 
establishing  the  competitiveness  of  my  methods. 


1 


1  INTRODUCTION 

Many  applications  in  science  and  engineering  give  rise  to  symmetric  eigenvalue  prob¬ 
lems  of  the  form 

Ax  =  \Bx  (1.1) 

where  the  matrices  A  and  B  are  sparse  and  often  quite  large.  We  seek  the  smallest 
magnitude  eigenvalues  of  a  given  matrix  pencil  (R,  B)  along  with  their  associated 
eigenvectors.  Computing  the  smallest  eigenvalues  is  more  difficult  than  computing 
the  largest,  because  it  often  necessitates  the  accurate  solution  of  linear  systems  at 
each  iteration.  This  can  be  problematic  for  direct  solvers  when  the  matrices  are 
large,  because  the  level  of  hll-in  may  be  too  large  for  such  factorizations  to  be  pos¬ 
sible.  Alternatively,  they  may  require  the  use  of  strong  preconditioners  with  limited 
scalability.  In  some  applications,  the  matrices  are  not  even  made  explicitly  available, 
which  makes  preconditioning  difficult  and  factorization  impossible.  In  this  disserta¬ 
tion,  I  will  present  several  eigensolvers  that  do  not  rely  on  accurate  linear  solves, 
which  I  will  refer  to  as  trace-minimization  eigensolvers. 

First,  we  will  discuss  a  few  sample  application  areas  that  give  rise  to  sparse  sym¬ 
metric  eigenvalue  problems.  One  application  area  is  the  modeling  of  acoustic  helds 
in  moving  vehicles,  which  is  governed  by  the  lossless  wave  equation.  By  applying  a 
hnite  element  discretization,  we  obtain  a  generalized  eigenvalue  problem  where  both 
the  stiffness  and  mass  matrices  are  ill-conditioned.  We  are  interested  in  computing 
all  eigenvalues  in  a  given  interval  along  with  their  associated  eigenvectors. 

Another  problem  of  interest  is  the  Anderson  model  of  localization,  which  models 
electron  transport  in  a  random  lattice.  To  examine  this  behavior,  we  must  solve  the 
time-independent  Schrodinger  equation,  a  standard  eigenvalue  problem.  The  eigen¬ 
values  of  that  matrix  represent  potential  energy,  and  the  eigenvectors  give  us  the 
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probability  of  an  electron  residing  at  a  particular  site;  we  are  interested  in  the  low¬ 
est  potential  energies,  meaning  the  eigenvalues  closest  to  0.  If  the  magnitude  of  each 
element  of  the  eigenvector  is  approximately  equal,  then  the  material  conducts.  Other¬ 
wise,  the  material  does  not.  This  problem  is  difficult  because  the  desired  eigenvalues 
are  interior,  which  are  notably  harder  to  obtain  than  extreme  eigenvalues. 

The  last  application  area  I  will  disucss  is  spectral  reordering.  Unweighted  band¬ 
width  reducing  reorderings  are  important  because  they  can  reduce  the  cost  of  parallel 
matrix-vector  multiplications  by  bringing  the  elements  of  the  matrix  toward  the  diag¬ 
onal,  resulting  in  less  communication  between  MPI  processes.  Weighted  reorderings 
can  be  useful  in  constructing  banded  preconditioners,  because  they  bring  the  large 
elements  of  a  matrix  toward  the  diagonal.  To  compute  the  Fiedler  vector  for  spectral 
reordering,  we  must  solve  a  standard  eigenvalue  problem  where  A  is  symmetric  posi¬ 
tive  semidehnite,  and  the  null  space  of  A  is  known.  This  problem  can  be  difficult  for 
some  eigensolvers  because  A  is  singular. 

After  providing  motivation  for  the  development  of  scalable  sparse  symmetric  eigen¬ 
solvers,  I  will  discuss  an  important  kernel  in  the  trace-minimization  eigensolvers:  the 
solution  of  saddle  point  problems  of  the  form 


A 

BY 

A 

AY 

Y^B 

0 

L 

0 

where  W  is  a  tall  dense  matrix  with  a  very  small  number  of  columns.  I  will  present 
three  types  of  methods  for  solving  that  linear  system,  then  discuss  under  which  cir¬ 
cumstances  each  should  be  used. 

One  way  to  solve  this  problem  is  by  using  a  projected  Krylov  method  to  solve  the 
equivalent  linear  system 

PAPA  =  PAY  (1.3) 


where 


P  =  I  -  BY  (Y^B^Y)  ^Y^B 


(1.4) 
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projects  onto  the  space  orthogonal  to  BY .  Another  method  of  solving  the  saddle 
point  problem  is  by  forming  the  Schur  complement 

S  =  -Y^BA-^BY  (1.5) 


After  we  obtain  the  Schur  complement  (which  can  be  inexact  if  we  used  a  Krylov 
method  to  determine  A~^BY),  we  may  construct  the  solution  A  =  K  +  A~^BYS~^ . 
The  last  method  we  will  discuss  is  the  use  of  block  preconditioned  Krylov  methods. 
We  may  look  at  our  original  saddle  point  problem  (equation  1.2)  as  a  linear  sys¬ 
tem  sY ^  ^  and  use  a  Krylov  subspace  method  on  the  entire  problem.  We  can 
precondition  this  linear  system  in  a  variety  of  ways.  One  such  preconditioner  is 


M  0 
0  A 


(1.6) 


where  M  is  a  preconditioner  approximating  A,  and  S  =  —Y^BM~^BY. 

After  exploring  how  to  solve  the  saddle  point  problems  arising  at  each  iteration  of 
the  trace  minimization  eigensolvers,  I  will  describe  two  such  solvers:  TraceMin  and 
TraceMin-Davidson.  As  the  name  suggests,  these  eigensolvers  transform  the  problem 
of  computing  the  desired  eigenpairs  into  the  equivalent  constrained  minimization 
problem 

min  trace  (K^AK)  (1.7) 

ytby=i  ^  ’ 

The  solution  to  this  problem  is  the  set  of  eigenvectors  corresponding  to  the  eigenvalues 
of  smallest  magnitude.  At  each  iteration  of  our  trace-minimization  eigensolver,  we 
will  compute  an  update  to  our  current  approximate  eigenvectors  W  such  that 
Afc  Ab  Yk  and 


trace  (^(W  -  ^  (Xk  -  ^k)'j  <  trace  (Tjf  AY/,)  (1.8) 

When  we  solve  the  corresponding  constrained  minimization  problem 

min  trace  ( (W  -  A/,)'^  A  (W  -  A/,) 


(1.9) 
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using  Lagrange  multipliers,  we  end  up  with  the  saddle  point  problem  previously 
discussed.  The  only  difference  between  TraceMin  and  TraceMin-Davidson  is  that 
TraceMin  extracts  its  Ritz  vectors  from  a  subspace  of  constant  dimension,  whereas 
TraceMin-Davidson  uses  expanding  subspaces.  These  algorithms  will  be  explained  in 
detail  in  their  respective  chapters. 

TraceMin  has  global  linear  convergence,  the  rate  of  which  is  based  on  both  the 
distribution  of  eigenvalues  and  the  constant  subspace  dimension  s.  In  the  Trace¬ 
Min  chapter,  I  will  present  small  test  cases  that  show  how  the  behavior  of  TraceMin 
changes  when  you  modify  various  parameters  such  as  the  subspace  dimension  or  tol¬ 
erance  of  the  Krylov  method.  I  will  also  explain  how  the  convergence  rate  can  be 
improved  by  using  dynamic  origin  shifts,  which  are  determined  by  the  Ritz  values  of 
the  matrix  pencil.  I  will  conclude  with  a  discussion  of  the  relationship  between  Trace¬ 
Min  and  simultaneous  iteration.  If  both  methods  solve  the  linear  systems  arising  at 
each  iteration  exactly  (using  a  direct  method),  the  methods  are  equivalent.  However, 
I  will  show  that  TraceMin  is  more  robust  and  tolerates  inexact  solves  with  very  little 
precision  better  than  simultaneous  iteration. 

In  the  TraceMin-Davidson  chapter,  I  will  discuss  how  the  method  differs  from 
TraceMin  through  the  use  of  expanding  subspaces.  I  will  also  present  a  small  exper¬ 
iment  showing  the  effect  of  the  block  size  on  hnding  eigenvalues  with  a  multiplicity 
greater  than  1.  Additionally,  I  will  describe  what  harmonic  Ritz  extraction  is  and 
how  it  can  help  when  computing  interior  eigenpairs. 

After  describing  the  theory  of  these  eigensolvers,  I  will  describe  my  parallel  im¬ 
plementations  of  these  methods  in  solving  different  types  of  problems.  First,  I  will 
discuss  my  publically  available  Trilinos  implementations,  which  are  designed  to  com¬ 
pute  a  small  number  of  eigenpairs  of  smallest  magnitude.  I  will  then  explain  how 
spectral  transformations  can  be  used  to  compute  the  largest  eigenvalues  or  the  eigen¬ 
values  nearest  a  given  shift,  and  how  spectrum  folding  can  allow  eigensolvers  which 
are  designed  for  the  computation  of  extreme  eigenpairs  to  compute  interior  ones  suc¬ 
cessfully.  In  addition,  I  will  describe  the  parallel  kernels  required  by  my  code. 
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The  next  sections  will  describe  my  Fortran-based  implementations  of  sampling 
and  multisectioning.  In  the  case  of  sampling,  we  are  interested  in  computing  the 
eigenvalues  closest  to  a  large  set  of  shifts;  in  multisectioning  (or  spectrum  slicing), 
we  are  interested  in  computing  all  eigenvalues  in  a  given  interval.  This  interval  often 
contains  a  large  number  of  eigenvalues.  I  will  present  a  multisectioning  algorithm 
loosely  based  on  adaptive  quadrature  which  divides  the  large  global  interval  of  interest 
into  many  subintervals  which  can  be  processed  independently.  My  method  performs 
both  the  interval  subdivision  step  and  the  eigensolver  steps  in  parallel  and  features 
dynamic  load  balancing  for  improved  scalability. 

After  we  have  thoroughly  explored  TraceMin  and  TraceMin-Davidson,  I  will  de¬ 
scribe  several  other  eigensolvers  which  will  compete  against  my  implementations  in 
the  numerical  experiments  section.  Arnoldi,  Lanczos,  and  Krylov-Schur  are  very  sim¬ 
ilar  methods,  all  of  which  require  the  accurate  solution  of  linear  systems  at  each 
iteration;  they  are  analogous  to  TraceMin-Davidson,  if  we  use  the  Schur-complement 
method  to  solve  the  saddle  point  problem  at  each  iteration.  The  Locally  Optimal 
Block  Preconditioned  Conjugate  Gradient  method  avoids  solving  linear  systems  en¬ 
tirely,  but  it  can  fail  if  not  given  a  strong  preconditioner.  Jacobi-Davidson  is  theo¬ 
retically  similar  to  TraceMin-Davidson,  except  that  it  uses  a  more  aggressive  shifting 
strategy  which  can  cause  it  to  miss  the  smallest  eigenpairs  or  converge  very  slowly. 
The  Riemannian  Trust  Region  method  is  very  closely  related  to  TraceMin,  but  it  uses 
the  exact  Hessian  in  solving  the  constrained  minimization  problem  whereas  TraceMin 
uses  a  cheap  approximation.  FEAST  is  a  contour  integration  based  eigensolver  which 
requires  both  an  interval  of  interest  and  an  estimate  of  the  number  of  eigenvalues 
that  interval  contains. 

Finally,  I  will  present  comparisons  between  my  methods  and  those  of  the  popu¬ 
lar  eigensolver  packages  Anasazi  (of  Sandia’s  Trilinos  library),  SLEPc,  and  FEAST, 
establishing  the  robustness  and  parallel  scalability  of  TraceMin. 
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2  MOTIVATING  APPLICATIONS 

In  this  section,  I  will  justify  the  need  for  a  robust  and  parallel  sparse  symmetric 
eigenvalue  problem  solver  such  as  TraceMin  by  presenting  several  application  areas 
which  give  rise  to  large  sparse  symmetric  eigenvalue  problems. 


2.1  Automotive  engineering 

Modeling  acoustic  helds  in  moving  vehicles  generally  uses  coupled  systems  of  par¬ 
tial  differential  equations  (PDEs).  The  systems  resulting  from  the  discretization  of 
these  PDEs  tend  to  be  extremely  large  and  ill-conditioned. 

The  lossless  wave  equation  in  air  is  given  by 

(-) 

where  p  represents  the  pressure  and  c  the  speed  of  sound;  for  a  derivation  of  this 
equation,  please  see  [1].  Neumann  boundary  conditions  are  given  by 


Sp 

Su 


r  Sp 


(2.2) 


St 

where  p  represents  the  density,  r  the  damping  properties  of  the  material,  and  u  the 
outer  normal.  We  may  apply  a  hnite  element  discretization  to  obtain  the  following 
equation  for  the  fluid. 


MfPd  + DfPd  + KfPd  + DsfUd  =  (2.3) 

where  Mf  is  a  spd  mass  matrix,  iPj  is  a  spd  stiffness  matrix,  Dj  is  a  spsd  damping 
matrix,  Dgf  is  a  spd  mass  matrix  representing  the  fluid  structure  coupling,  and  u 
represents  the  vector  of  displacements. 
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The  discrete  finite  element  model  for  the  vibration  of  the  structure  is 


Msild  +  Dsiid  +  KsUd  -  D^jpd  =  fe 


(2.4) 


with  Mg  and  Kg  spd,  Dg  spsd,  and  fe  the  external  load.  If  we  combine  equations  2.3 
and  2.4,  we  obtain 


Mg 

0 

Ud 

+ 

Dg 

0 

Ud 

+ 

Dgf 

M/_ 

Pd 

0 

Pd 

Kg{u)  -D^ 

)  Kf 


1 - 

1 _ 

Pd 

0 

(2.5) 


We  then  perform  a  Fourier  ansatz 

Ud 
Pd 


e^^\fs  =  fe 


iujt 


(2.6) 


to  obtain  the  following  frequency  dependent  linear  system 

Kgiu)  -Djf 
0  Kf 

We  may  also  write  this  as  a  symmetric  problem  for  nonzero  frequencies. 


(  . 

Mg 

0 

Ds 

0 

r 

Dgf 

Mf_ 

+  iu 

0 

1 

+ 

u  (cj) 

/M 

P(uj) 

0 

(2.7) 


-u 


Mg  0 
0  Mf 


lUJ 


Dg  zDjf 


W 


Sf 


D 


f 


+ 


Kg  {u)  0 

0  Kf 


it  {u) 

/M 

uj~^p  (u) 

0 

(2.8) 


Although  these  systems  have  very  large  dimensions,  we  are  typically  only  inter¬ 
ested  in  the  low  frequencies  associated  with  the  eigenvalues  in  the  neighborhood  of 
zero  of  the  following  symmetric  matrix  function 


Q  (ca)  =  —u‘^M  +  iuD  +  K  (2.9) 

where  iF’s  nonlinear  dependency  on  the  frequency  is  ignored.  In  the  absence  of 
damping,  equation  2.9  gives  rise  to  the  following  generalized  eigenvalue  problem 

Kx  =  \Mx  (2.10) 

where,  in  exact  arithmetic  M  and  K  are  spd,  but  M  is  singular  to  working  preci¬ 
sion  in  floating-point  arithmetic  due  to  the  fact  that  rotational  masses  are  omitted  [1] . 
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Figure  2.1.:  Behavior  of  an  externalized  (left)  and  localized  (right)  wavefunction  for 
the  three-dimensional  Anderson  model  with  periodic  boundary  conditions  aX  E  =  Q 
with  N  =  100^  and  hF  =  12  and  21  respectively  [4] 


2.2  Condensed  matter  physics 

In  1958,  P.W.  Anderson  proposed  a  model  for  electron  transport  in  a  random 
lattice  [2].  Although  it  was  later  discovered  that  Anderson  localization  may  occur  for 
any  wave  propagating  through  a  disordered  medium  [3],  we  focus  on  the  model  as  it 
applies  to  conductivity. 

In  this  model,  we  have  an  array  of  sites  called  a  lattice.  These  sites  are  occupied 
by  entities  such  as  atoms.  The  basic  technique  is  to  place  a  single  electron  in  the 
lattice  and  study  the  resulting  behavior  of  the  wave  function.  The  wave  function  tells 
us  the  probability  of  hnding  an  electron  at  a  particular  site.  If  the  probability  of 
hnding  an  electron  at  certain  sites  is  practically  zero,  Anderson  localization  occurs  as 
in  hgure  2.1. 
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To  examine  this  phenomenon,  we  solve  the  time-independent  Schroedinger  equa¬ 
tion 

Em  =  Hm  (2.11) 

with  Hamiltonian 

(h4>)  (])  =  EjHl)  +  i\k  -  j\)  4:  (k)  (2.12) 

The  goal  is  to  find  the  stationary  states  T  with  low  energy  E,  i.e.  to  find  the  eigen¬ 
values  closest  to  0  and  their  associated  eigenvectors. 

The  first  term  of  the  Hamiltonian  accounts  for  the  probability  of  an  electron  at 
a  particular  site  staying  there,  and  it  is  based  on  the  randomly  assigned  energy  at 
that  site,  Ej.  The  second  term  accounts  for  the  probability  of  the  electron  hopping 
to  an  adjacent  site  and  is  based  on  the  interaction  term  V  (r).  For  simplicity,  we  may 
choose 

f  1  if  |r|  =  1 

r(|r|)  =  l  (2.13) 

I  0  otherwise 

This  gives  us  a  matrix  with  the  same  structure  as  the  seven-point  central  difference 
approximation  to  the  three-dimensional  Poisson  equation  on  the  unit  cube  with  pe¬ 
riodic  boundary  conditions. 

The  difficulty  of  this  problem  arises  from  the  random  entries  on  the  diagonal  and 
the  large  cluster  of  eigenvalues  around  0.  Each  Ej  is  taken  from  a  uniform  distribution 
in  [— lH/2,  -l-lH/2]  with  some  W  G  [1,  30],  meaning  the  Hamiltonian  matrix  will  likely 
be  symmetric  indefinite.  This  W  changes  the  behavior  of  the  material  in  the  following 
way: 

•  If  IH  <<  16.5,  the  eigenvectors  are  extended  and  the  material  will  be  a  conduc¬ 
tor. 

•  If  IH  >>  16.5,  all  eigenvectors  are  localized  and  the  material  will  be  an  insulator. 

•  W  =  Wc  =  16.5  is  a  critical  value  where  the  extended  states  around  E  =  0 
vanish  and  no  current  can  flow. 
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To  numerically  distinguish  between  these  three  cases,  we  must  look  at  a  series  of 
many  large  problems  of  order  10®  to  10®  with  various  random  diagonals  [5]. 

2.3  Spectral  reordering 

Parallel  sparse  matrix-vector  multiplications  y  =  Ax  can  be  very  expensive  oper¬ 
ations  due  to  low  data  locality.  The  cost  of  such  operations  is  based  on  the  sparsity 
pattern  of  the  matrix  A.  General  sparse  matrices  can  require  collective  communica¬ 
tion  between  all  MPI  processes,  which  is  very  undesirable  when  running  on  a  large 
number  of  nodes.  However,  if  A  is  banded,  we  only  require  point-to-point  com¬ 
munication  between  nearest  neighbors.  As  a  result,  we  may  wish  to  permute  the 
elements  of  A  to  gain  a  more  favorable  sparsity  pattern,  keeping  in  mind  that  the 
cost  of  computing  this  permutation  will  be  amortized  over  a  large  number  of  matrix- 
vector  multiplications.  One  method  of  obtaining  this  permutation  is  by  computing 
the  Fiedler  vector  [6].  Computing  the  Fiedler  vector  of  an  unweighted  graph  pro¬ 
duces  a  bandwidth-reducing  reordering,  whereas  computing  the  Fiedler  vector  of  a 
weighted  graph  produces  a  reordering  which  brings  large  elements  toward  the  diago¬ 
nal.  Bandwidth-reducing  reorderings  are  meant  to  reduce  the  cost  of  a  matrix-vector 
product,  and  weighted  spectral  reorderings  can  be  part  of  an  effective  preconditioning 
strategy;  after  bringing  the  large  elements  toward  the  diagonal,  we  can  extract  a  band 
from  our  reordered  matrix  to  be  used  as  a  preconditioner.  This  preconditoner  could 
be  applied  by  a  scalable  banded  solver  such  as  SPIKE  [7, 8] . 

In  this  case,  we  are  interested  in  the  eigenvector  corresponding  to  the  smallest 
nonzero  eigenvalue  of  a  standard  eigenvalue  problem  Lx  =  Xx,  where  L  is  the  graph 
Laplacian.  Assuming  D  is  the  degree  matrix  of  A  and  J  is  the  adjacency  matrix, 
L  =  D  —  J.  The  graph  Laplacian  L  is  symmetric  positive  semi-dehnite.  If  the  graph 
consists  of  only  one  strongly  connected  component,  the  graph  Laplacian  A’s  null 
space  is  exactly  one  vector,  the  vector  of  all  Is^.  If  the  graph  consists  of  multiple 
^We  assume  for  simplicity  that  A  was  not  normalized. 
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components,  it  can  be  split  up  into  many  smaller  eigenproblems,  one  per  strongly 
connected  component,  and  these  problems  can  be  solved  independently. 
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3  PARALLEL  SADDLE  POINT  SOLVERS 

The  goal  of  this  chapter  is  the  solution  of  the  following  saddle  point  problem,  which 
arises  in  every  TraceMin  iteration. 


A 

BY 

A 

AY 

Y^B 

0 

L 

0 

A  is  symmetric,  Y  has  many  more  rows  than  columns  and  is  assumed  to  have  full 
column  rank.  We  also  know  that  BY  =  I. 

We  will  examine  several  ways  of  solving  linear  systems  with  this  special  structure 
on  parallel  architectures. 

3.1  Using  a  projected  Krylov  method 

This  is  the  method  originally  used  in  the  1982  implementation  of  a  basic  trace 
minimization  eigensolver  [9].  Solving  the  system  (3.1)  is  equivalent  to  solving  the 
following  linear  system 

PAPA  =  PAY  (3.2) 

where 

P  =  I  -BY  (Y^B'^Y)  Y'^B  (3.3) 

projects  onto  the  space  R-orthogonal  to  Y.  Since  this  linear  system  is  consistent, 
we  can  use  a  Krylov  subspace  method  to  solve  it  (even  though  our  operator  PAP  is 
singular) . 

If  we  choose  our  initial  iterate  Aq  K,  applying  the  symmetric  operator  PAP 
to  a  vector  (or  set  of  vectors)  at  each  iteration  is  equivalent  to  applying  PA.  That 
means  we  can  use  a  symmetric  Krylov  method  such  as  the  conjugate  gradient  method 
or  MINRES  and  still  only  require  one  projection  [10,11]. 
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3.2  Forming  the  Schur  complement 

By  performing  block  Gaussian  elimination  on  equation  3.1,  we  obtain  the  following 
result. 

A  =  F  +  ZS-^  (3.4) 

where  Z  =  A~^BY  and  S  =  —Y^BZ  is  the  Schur  complement.  We  can  solve  AZ  = 
BY  approximately  using  a  Krylov  method  to  obtain  the  inexact  Schur  complement 
S  =  —BY^Z.  Once  we  have  Z  and  S,  we  can  compute  A  using  a  small  dense  solve 
and  a  vector  addition. 


3.3  Block  preconditioned  Krylov  methods 


Another  alternative  is  to  use  a  Krylov  subspace  method  on  the  entire  problem, 
meaning  we  have  the  operator 


A 

Y^B 


BY 

0 


(3.5) 


We  can  precondition  this  operator  in  a  variety  of  ways.  If  we  use  the  preconditioner 


A 

0 


0 


(3.6) 


where  S  =  —Y^BA~^BY  is  the  Schur  complement,  our  preconditioned  matrix 
has  at  most  four  distinct  eigenvalues  [12].  Therefore,  we  would  converge  in  at  most 
four  iterations  of  MINRES  in  exact  arithmetic.  However,  each  iteration  would  involve 
the  accurate  solution  of  linear  systems  involving  A,  which  can  be  very  expensive. 

Alternatively,  we  can  replace  A  by  a  preconditioner  M  to  obtain 


M  0 
0 


(3.7) 


where  S  =  —Y^BM  ^BY.  Since  Y  is  very  narrow,  we  can  compute  the  matrix  S 
explicitly  and  replicate  it  across  all  MPI  processes.  The  application  of  this  precon- 
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ditioner  ^  only  involves  an  application  of  the  preconditioner  M  and  a  small  dense 
solve  with  S. 

Note  that  this  is  not  the  only  possible  preconditioning  strategy  for  this  saddle 
point  problem.  Instead  of  using  that  block  diagonal  preconditioner,  we  could  use  a 
block  triangular  preconditioner  such  as 


M  BY 
0 


(3.8) 


but  that  would  prevent  us  from  using  a  symmetric  solver  such  as  MINRES  to  solve 
linear  systems  with  sY .  We  could  also  choose  to  do  constraint  preconditioning  with 


= 


M  BY 
Y^B  0 


(3,9) 


as  in  [13]. 

Equation  3.1  is  equivalent  to  the  following 


A 

BY 

A 

AY 

-Y^B 

0 

L 

0 

(3.10) 


The  operator  of  (3.1)  is  symmetric  indehnite;  the  operator  of  (3.10)  is  nonsym- 
metric,  but  all  eigenvalues  will  be  on  one  side  of  the  imaginary  axis.  We  could 
use  a  Hermitian/Skew-Hermitian  splitting  based  preconditioner  on  this  problem  as 
in  [14,15]. 


3.4  Which  method  to  choose 

All  of  these  methods  have  been  incorporated  into  Sandia’s  publicly  available  Trace- 
Min  code.  Each  of  the  methods  has  its  own  unique  advantages  and  disadvantages, 
summarized  in  table  3.1. 

In  short,  I  recommend  the  following  strategy: 

•  If  you  want  to  factor  your  matrix  A,  form  the  Schur  complement.  Note  that 
TraceMin  does  not  generally  require  accurate  solutions  of  linear  systems  involv¬ 
ing  A,  so  this  can  be  overkill. 
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Table  3.1:  Comparison  of  saddle  point  solvers 


Projected 

Krylov 

Schur 

complement 

Block 

preconditioning 

Strictly  enforces  the  condition 

ATb  F? 

yes 

no 

no 

Application  of  the  operator  re¬ 
quires  an  inner  product? 

yes 

no 

yes 

Capable  of  using  preconditioned 

MINRES? 

no 

yes 

yes 

Capable  of  using  a  direct  solver? 

no 

yes 

no 

16 


•  If  you  want  to  use  a  preconditioner  M  ^  A,  choose  the  block  diagonal  precon¬ 
ditioning  method.  Again,  since  TraceMin  does  not  require  accurate  solutions  of 
linear  systems  involving  A,  it  should  not  need  a  strong  preconditioner.  Precon¬ 
ditioning  the  projected-Krylov  solver  does  not  generally  perform  well  because  it 
requires  solutions  of  nonsymmetric  linear  systems,  which  are  considerably  more 
expensive  than  the  symmetric  case.  Forming  the  inexact  Schur  complement  is 
another  possibility,  but  it  tends  to  perform  poorly  without  a  strict  tolerance  be¬ 
cause  the  linear  system  being  solved  at  each  iteration  is  completely  disconnected 
from  the  requirement  that  A  Ab  Y. 

•  Otherwise,  use  the  projected  Krylov  method. 
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4  TRACEMIN 

The  generalized  eigenvalue  problem  considered  here  is  given  by 

Ax  =  \Bx  (4.1) 

where  A,  B  are  nxn  very  large,  sparse,  and  symmetric,  with  B  being  positive  dehnite, 
and  one  is  interested  in  obtaining  a  few  eigenvalues  p  n  of  smallest  magnitude  and 
their  associated  eigenvectors. 

4.1  Derivation  of  TraceMin 

TraceMin  is  based  on  the  following  theorem,  which  transforms  the  problem  of 
solving  equation  (4.1)  into  a  constrained  minimization  problem. 

Theorem  4.1.1  [9, 16]  Let  A  and  B  he  symmetric  nxn  matrices  with  B  positive 
definite,  and  Y*  the  set  of  all  n  x  p  matrices  Y  for  which  Y^ BY  =  Ip.  Then 

p 

min  ti {Y^ AY)  =  A*  (4.2) 

i=l 

where  Ai  <  A2  <  ■  ■  ■  Ap  <  Ap+i  <  ■  ■  ■  <  A„  are  the  eigenvalues  of  problem  (4.1). 

The  block  of  vectors  Y  which  solves  the  constrained  minimization  problem  is  the 
set  of  eigenvectors  corresponding  to  the  eigenvalues  of  smallest  magnitude.  I  will  now 
discuss  how  TraceMin  is  derived  from  this  observation. 

Let  T  be  a  set  of  vectors  approximating  the  eigenvectors  of  interest.  At  each 
TraceMin  iteration,  we  wish  to  construct  17+1  from  I7  such  that  tr  (l7^j^A17+i)  < 
tr  (YjfiAYk).  Consequently,  I7+1  is  a  better  approximation  of  the  eigenvectors  than 
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Ifc.  The  iterate  Y^.  is  corrected  using  the  n  x  p  matrix  A^,  which  is  the  solution  of 
the  following  constrained  optimization  problem, 


minimize  tr  (Yk  -  A^.)^  A  (Y^  -  A^) , 
subject  to  Y^fBAk  =  0 


(4.3) 


(4.4) 


If  A  is  symmetric  positive  definite,  this  is  equivalent  to  solving  the  p  independent 
problems 

minimize  {yk,i  -  dk,i)  A  {yk,i  -  dk,i) , 
subject  to  yl^^BAk  =  0 
where  yk,i  is  the  i-th.  column  of  Y^. 

Figure  4.1  presents  a  graphical  illustration  of  how  TraceMin  works  for  the  3x3 
matrix 


(4.5) 


1.67 

-0.33 

-0.33 

-0.33 

2.17 

-0.83 

-0.33 

-0.83 

2.17 

We  seek  the  smallest  eigenpair,  Ai  =  1  with  eigenvector 


Xi  = 


using  an  initial  subspace  of 


X  = 


1 

F3 

1 

F3 

1 

F3 


0 

0 

1 


(4.6) 


(4.7) 


The  colored  plane  represents  the  space  orthogonal  to  our  subspace.  We  would  like  to 
hnd  the  update  vector  d  minimizing  the  quantity  {x  +  d)^  A{x  -\-  d)  in  that  subspace 
(i.e.  x'^d  =  0)  The  light  yellow  oval  denotes  the  area  of  the  subspace  where  that 


^To  make  the  plot  more  intuitive,  I  have  chosen  to  refer  to  our  updated  vector  as  v  =  x  -\-  d  rather 
than  V  =  X  —  d. 
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quantity  is  smallest,  and  the  dark  blue  denotes  the  area  where  the  quantity  is  large. 
The  increment  d  which  minimizes  that  trace  is 


d  = 


0.2857 

0.4286 

0 


(4.8) 


Note  that  x  +  d  is  much  closer  to  our  true  solution  than  our  initial  guess  x.  We  now 
turn  our  attention  to  how  to  solve  this  constrained  minimization  problem  for  larger 
matrices. 

We  can  transform  our  constrained  minimization  problem  to  an  nnconstrained 
minimization  problem  using  Lagrange’s  theorem,  which  leads  to  the  following  saddle 
point  problem  for 


A 

BYk 

AYk 

y^B 

0 

Lk 

0 

where  L^.  represents  the  Lagrange  multipliers.  Alternatively,  we  may  write  the  above 
saddle-point  problem  as 


A 

BY, 

14+1 

0 

y^B 

0 

Lk 

(4.10) 


where  14+1  =  W  ~  and  =  —L^.  Assnming  our  matrix  A  is  symmetric  positive 
dehnite,  we  have  satished  the  second  order  sufficient  conditions  for  optimality;  is 
guaranteed  to  be  the  solution  of  onr  original  constrained  minimization  problem.  If  A 
is  indehnite,  we  have  no  such  guarantee,  but  as  my  results  will  demonstrate,  TraceMin 
is  still  capable  of  computing  the  smallest  eigenpairs.  After  14+i  is  obtained,  we  B- 
ort honor malize  it  and  use  the  Rayleigh- Ritz  procedure  to  generate  W+i.  This  process 
is  snmmarized  in  Algorithm  1. 

Now  that  the  TraceMin  algorithm  has  been  presented,  we  now  turn  onr  attention 
to  its  convergence  properties  and  some  implementation  details. 
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Space  orthogonal  to  x 

X 


d 


x+d 

True  solution 


Figure  4.1.:  Graphical  demonstration  of  the  TraceMin  algorithm 
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Algorithm  1  TraceMin  algorithm 
Require:  Subspace  dimension  s  >  p, 

Vi  e  with  rank  s, 

A  and  B  symmetric,  with  B  also  positive  dehnite 
1:  for  /c  =  1  — )■  maxit  do 
2:  B-orthonormalize  14 

3:  Perform  the  Rayleigh-Ritz  procedure  to  obtain  the  approximate  eigenpairs 

{AYu  ^  BYkQk)-. 

•  Form  Hk  =  Vi[ AVk 

•  Compute  all  eigenpairs  of  Hk,  HkXk  =  X^Qk 

•  Compute  the  Ritz  vectors  14  =  14^^ 

4:  Compute  the  residual  vectors  Rk  =  AYk  —  BYkQk 

5:  Test  for  convergence 

6:  Solve  the  saddle  point  problem  (4.10)  approximately  to  obtain  I4+1 

7:  end  for 
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4.2  Convergence  rate 

TraceMin  is  globally  convergent,  and  if  Uk^i  is  the  ith  column  of  1^,  the  column 
yk,ii  converges  to  the  eigenvector  Xi  corresponding  to  for  i  =  1,  2,  •  ■  ■  ,p  with  an 
asymptotic  rate  of  convergence  bounded  by  Ai/A^+i,  where  s  is  the  subspace  dimension 
(or  number  of  vectors  in  Y).  As  a  result,  eigenvalues  located  closer  to  the  origin  will 
converge  considerably  faster  than  ones  near  A^+i. 

We  now  turn  our  attention  to  a  synthetic  test  matrix  which  demonstrates  this 
convergence  rate  in  practice.  Our  synthetic  test  matrix  is  order  100,  with  eigenval¬ 
ues  (0.01,  0.1,  0.5,  0.904,  0.905,.. .,0.999,1).  We  will  run  TraceMin  with  a  subspace 
dimension  of  four  vectors  and  examine  how  long  it  takes  each  vector  to  converge  to 
an  absolute  residual  ||r  =  Ay  —  0y\\2  <  10“®.  Note  that  the  hrst  vector  will  have  a 
convergence  rate  of  ~  0.011,  and  the  last  vector  will  have  a  convergence  rate  of 
^"1^  ~  0.999.  That  means  TraceMin  should  take  less  than  ten  iterations  to  compute 
the  first  (smallest)  eigenpair,  but  it  will  take  thousands  to  compute  the  fourth.  Fig¬ 
ures  4.2a  and  4.2b  show  the  absolute  residual  ||rj  =  Ayi  —  9iyi\\2  and  absolute  error 
Ci  =  —  Ail  measured  across  50  TraceMin  iterations. 

In  practice,  we  generally  can  not  measure  the  error,  as  we  do  not  know  the  eigen¬ 
values  of  interest;  we  can  only  measure  the  residual.  It  is  important  to  note  that  while 
the  error  decreases  monotonically,  the  residual  may  not.  In  this  case,  the  initial  Ritz 
values  (the  approximate  eigenvalues)  are  in  the  range  (0.9,0.96).  These  Ritz  values 
are  very  close  to  true  eigenvalues,  but  those  are  not  the  eigenvalues  we  seek.  The 
residual  only  tells  us  whether  a  given  Ritz  pair  approximates  some  eigenpair,  not 
whether  it  approximates  the  one  we  want.  This  is  why  the  residual  appears  to  spike 
in  Figure  4.2a,  while  the  error  decreases  monotonically  in  Figure  4.2b. 

Since  we  have  so  far  only  discussed  the  subspace  dimension  as  some  constant, 
we  will  now  turn  our  attention  to  its  impact  on  convergence  and  how  it  should  be 
chosen.  Later,  we  will  also  examine  how  to  improve  the  convergence  rate  of  TraceMin 
via  shifting. 
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(a)  Absolute  residual  for  each  eigenvalue 


(b)  Absolute  error  for  each  eigenvalue 


Figure  4.2.:  Demonstration  of  TraceMin’s  convergence  rate 
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4.3  Choice  of  the  subspace  dimension 

TraceMin  uses  a  constant  subspace  dimension  s,  where  s  is  the  number  of  vectors 
in  V.  The  choice  of  this  subspace  dimension  s  is  very  important.  Larger  subspace 
dimensions  may  cut  down  on  the  number  of  TraceMin  iterations  required  (because 
the  convergence  rate  Aj/A^+i  improves),  but  each  iteration  then  involves  more  work. 
Small  subspace  dimensions  reduce  the  amount  of  work  done  per  TraceMin  iteration 
but  result  in  a  worse  convergence  rate.  To  demonstrate  the  effect  of  the  subspace 
dimension  on  overall  work,  I  will  now  present  an  example  of  what  happens  when  we 
vary  the  subspace  dimension. 

This  synthetic  test  matrix  is  order  100,  with  eigenvalues 

(0.1,  0.11, ... ,  0.16, 0.17,  0.909,  0.91, . . . ,  0.999, 1) 

We  will  run  TraceMin  with  a  subspace  dimension  of  four  vectors,  then  eight  vectors, 
and  hnally  with  twelve  vectors  and  examine  how  many  iterations  it  takes  TraceMin  to 
converge.  Figure  4.3  shows  that  increasing  the  block  size  for  this  matrix  decreased  the 
number  of  required  TraceMin  iterations.  Increasing  from  s  =  4  to  s  =  8  had  a  greater 
impact  than  increasing  from  s  =  8  to  s  =  12  since  this  problem  featured  a  large  gap 
between  the  eighth  and  ninth  eigenvalues.  For  this  problem,  it’s  clear  that  a  subspace 
dimension  of  s  =  8  is  optimal,  given  the  eigenvalue  distribution.  We  generally  can  not 
determine  the  optimal  subspace  dimension,  since  we  do  not  have  enough  information 
about  the  spectrum.  In  practice,  s  =  2p  tends  to  work  well  (where  p  is  the  desired 
number  of  eigenvalues). 

For  the  tiny  examples  I  have  presented  so  far,  the  saddle  point  problems  were 
solved  directly.  When  the  matrices  become  larger,  this  may  be  unreasonable.  We  will 
now  explore  the  effect  of  using  an  iterative  method  to  solve  the  saddle  point  problem. 


Absolute  residual 
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Figure  4.3.:  Demonstration  of  the  importance  of  the  block  size.  This  hgure  presents 
the  residual  of  the  fourth  eigenvalue  vs  number  of  TraceMin  iterations 
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4.4  TraceMin  as  a  nested  iterative  method 

In  TraceMin,  the  saddle  point  problem  does  not  need  to  be  solved  to  a  high  degree 
of  accuracy  to  preserve  its  global  convergence,  e.g.  see  [9],  and  [17].  Hence,  one  can 
use  an  iterative  method  with  a  modest  relative  residual  as  a  stopping  criterion.  We 
will  later  compare  the  various  saddle  point  solvers  presented  in  the  previous  chapter, 
but  for  now  we  concentrate  on  the  selection  of  the  inner  (Krylov)  tolerance.  To 
demonstrate  the  impact  of  the  inner  tolerance  on  the  convergence  of  TraceMin,  we  will 
study  two  synthetic  examples  in  which  we  seek  the  smallest  eigenpair  with  a  subspace 
dimension  of  one  vector  We  converge  when  the  relative  residual  ||rj||2  /K  <  10“^. 
One  of  these  examples  will  involve  a  matrix  with  poorly  separated  eigenvalues,  and 
the  other  involves  a  matrix  with  well  separated  eigenvalues. 

The  hrst  synthetic  example  is  a  100x100  matrix  with  a  condition  number  of  ap¬ 
proximately  200.  Its  two  smallest  eigenvalues  are  4.29e-2  and  4.34e-2.  Note  that 
these  eigenvalues  are  clustered,  so  TraceMin  will  take  many  iterations  to  converge, 
regardless  of  how  accurately  we  solve  the  saddle  point  problem.  First,  we  will  use 
a  direct  solver  to  provide  a  lower  bound  on  the  number  of  TraceMin  iterations  re¬ 
quired,  then  we  will  try  projected-CG  with  various  tolerances.  In  Figure  4.4a,  we 
see  that  it  took  TraceMin  roughly  180  iterations  to  converge,  regardless  of  whether 
we  used  a  direct  solve  or  an  iterative  method  with  a  moderate  tolerance.  If  we  re¬ 
quire  a  modest  residual  in  the  linear  solve  (a  tolerance  of  0.5),  it  only  takes  20  more 
TraceMin  iterations  than  if  we  had  used  a  direct  solve.  Figure  4.4b  shows  that  the 
overall  work  required  by  TraceMin  with  an  inaccurate  Krylov  solver  is  far  less  than 
with  the  stricter  tolerances.  This  example  demonstrates  that  it  does  not  matter  how 
accurately  we  solve  the  saddle  point  problem  if  TraceMin’s  convergence  rate  is  poor 
due  to  clustered  eigenvalues  and  too  small  a  block  size  s. 

The  second  synthetic  example  is  a  100x100  matrix  with  a  condition  number  of 
approximately  2e7.  Its  two  smallest  eigenvalues  are  3.58e-7  and  3.58e-3.  Note  that 

^In  general,  it  is  a  poor  decision  to  use  such  a  small  subspace  dimension. 
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(a)  Residual  of  the  smallest  eigenvalue  vs  number  of  TraceMin 
iterations 


(b)  Residual  of  the  smallest  eigenvalue  vs  number  of  projected- 
CG  iterations 


Figure  4.4.:  Demonstration  of  the  importance  of  the  inner  Krylov  tolerance  for  a 
problem  with  poorly  separated  eigenvalues 


these  are  well  separated  eigenvalues,  so  TraceMin,  with  a  block  size  s  =  1,  will 
converge  in  a  small  number  of  iterations  if  a  direct  solver  is  used.  Unlike  the  previous 
example,  Figure  4.5a  shows  there  is  a  dramatic  difference  in  the  number  of  TraceMin 
iterations  based  on  the  inner  projected-CG  tolerance.  If  we  use  a  relatively  strict 
tolerance  of  le-4,  TraceMin  converges  in  only  four  iterations;  with  a  tolerance  of  0.5, 
TraceMin  takes  25  iterations  to  converge.  When  we  examine  the  number  of  projected- 
CG  iterations  in  Figure  4.5b,  we  see  that  in  this  case,  it  was  more  efficient  to  use  a 
stricter  tolerance  because  of  the  separation  of  eigenvalues. 

The  moral  is,  the  more  clustered  the  eigenvalues  are,  the  less  important  it  is  to 
solve  the  linear  systems  accurately.  However,  we  generally  know  very  little  about  the 
clustering  of  the  eigenvalues  prior  to  running  TraceMin.  We  can  attempt  to  estimate 
the  convergence  rate  of  each  eigenpair  by  using  the  Ritz  values,  but  the  Ritz  values 
tend  to  be  very  poor  estimates  of  the  eigenvalues  for  at  least  the  first  few  TraceMin 
iterations.  In  my  TraceMin  implementation,  I  compensate  for  this  by  choosing  the 
tolerance  based  on  both  the  Ritz  values  and  the  current  TraceMin  iteration.  The 
exact  equation  I  use  is  as  follows 

to/i  =  min  (4.11) 

where  i  is  the  index  of  the  targetted  right  hand  side,  j  is  the  current  TraceMin 
iteration  number,  and  6  are  the  current  Ritz  values.  Since  this  expression  does  not 
make  sense  for  i  =  s,  I  choose  tolg  =  tolg-i-  I  also  specify  a  maximum  number  of 
Krylov  iterations  to  be  performed,  since  TraceMin  does  not  rely  on  accurate  linear 
solves  to  converge. 

4.5  Deflation  of  converged  eigenvectors 

So  far,  we  have  not  discussed  what  to  do  when  an  eigenpair  converges.  We  would 
like  to  remove  it  from  our  subspace  V  so  that  we  do  not  continue  to  do  unnecessary 
work  improving  a  vector  which  has  already  converged.  However,  we  need  to  ensure 
that  after  we  remove  a  converged  vector  from  the  subspace,  the  subspace  stays  B- 
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(a)  Residual  of  the  smallest  eigenvalue  vs  number  of  TraceMin 
iterations 


(b)  Residual  of  the  smallest  eigenvalue  vs  number  of  projected- 
CG  iterations 


Figure  4.5.:  Demonstration  of  the  importance  of  the  inner  Krylov  tolerance  for  a 
problem  with  well  separated  eigenvalues 
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orthogonal  to  it,  or  else  we  will  converge  to  the  same  vector  over  and  over  again.  If 
C  is  our  set  of  converged  eigenvectors,  the  projector 

P  =  I  -  BC  {C^B^Cy^  C^B  (4.12) 

applied  to  our  subspace  V  will  preserve  that  condition  by  forcing  PV  -Lb  C.  This 
process  of  projecting  the  converged  vectors  from  the  subspace  is  called  deflation^.  If 
we  add  this  feature  to  Algorithm  1,  we  end  up  with  Algorithm  2.  The  few  steps  this 
adds  to  the  TraceMin  iterations  have  been  highlighted  in  red. 

After  a  Ritz  vector  converges,  we  may  either  remove  it  from  the  subspace  and 
continue  to  work  with  a  smaller  subspace  of  dimension  s  — 1,  or  we  may  replace  it  with 
a  random  vector.  If  we  do  not  replace  the  converged  vector,  our  linear  systems  have 
one  fewer  right  hand  side,  and  TraceMin  will  require  less  work  per  iteration.  However, 
if  we  replace  the  converged  vector  with  a  random  one,  the  convergence  rate  for  the 
nonconverged  Ritz  vectors  will  improve  and  we  will  require  fewer  TraceMin  iterations 
overall.  In  my  implementation,  I  replaced  the  converged  vectors  with  random  ones 
and  held  the  subspace  dimension  constant. 

4.6  Ritz  shifts 

The  convergence  rate  of  TraceMin  is  based  on  the  location  of  the  eigenvalues  of 
interest  within  the  spectrum.  As  we  have  seen,  if  they  are  far  from  the  origin,  the  rate 
of  convergence  is  very  poor.  Therefore,  it  can  be  worthwhile  to  perform  a  shift  which 
moves  the  desired  eigenvalues  closer  to  the  origin.  Instead  of  solving  our  original 
problem  Ax  =  \Bx,  we  will  solve  the  problem  {A  —  uB)  x  =  {X  —  u)  Bx,  where  u  is 
our  shift.  The  convergence  rate  for  eigenpair  i  is  now 

Xj-u 

As+i  — 


(4.14) 


^The  Trilinos  documentation  refers  to  this  as  locking,  but  it  is  the  same  concept. 
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Algorithm  2  TraceMin  algorithm  (with  deflation) 

Require:  Subspace  dimension  s  >  p, 

Vi  e  with  rank  s, 

A  and  B  symmetric,  with  B  also  positive  deflnite 
1:  for  fc  =  1  — )■  maxit  do 
2:  if  C  is  not  empty  then 

3:  Perform  the  projection  14  =  PVk,  where  P  =  I  —  BC  {C^B‘^C')  ^  C^B 

4:  end  if 

5:  B-orthonormalize  14 

6:  Perform  the  Rayleigh-Ritz  procedure  to  obtain  the  approximate  eigenpairs 

{AYk  ^  BYkQk) 

7:  Form  Hk  =  Pf  414 

8:  Compute  all  eigenpairs  of  Hk,  HkXk  =  XkQk 

9:  Compute  the  Ritz  vectors  14  =  VkXk 

10:  Compute  the  residual  vectors  Rk  =  AYk  —  BYkQk 

11:  Test  for  convergence 

12:  Move  converged  vectors  from  R  to  C* 

13:  Solve  the  following  saddle  point  problem  approximately  to  obtain  Vk+i 


A  BYk  BC 

1 - 

Y^B  0  0 

Cl) 

^k 

= 

0 

C^B  0  0 

T  (2) 

. 

0 

14:  end  for 
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rather  than  Aj/A^+i.  If  a;  ~  Aj,  eigenpair  i  will  converge  very  quickly.  The  only 
change  these  shifts  necessitate  in  the  TraceMin  algorithm  is  that  the  saddle  point 
problem  of  Equation  4.9  becomes 

{A-uB)  BYk 
BY^  0 

The  matrix  A  —  uB  may  be  formed  explicitly,  or  it  may  be  applied  implicitly 

1  will  now  present  a  small  synthetic  test  problem  demonstrating  the  effect  of  these 
shifts.  Suppose  we  wish  to  hnd  the  four  smallest  eigenpairs  of  a  test  matrix  with 
an  absolute  residual  of  10“®.  This  test  matrix  has  1000  rows,  and  its  eigenvalues  he 
evenly  spaced  in  the  interval  [0.91, 10.9].  We  will  run  TraceMin  twice  using  a  subspace 
dimension  of  nine  vectors.  The  first  time,  we  will  use  the  original  matrix  without  a 
shift,  and  then  we  will  try  TraceMin  with  a  shift  of  0.9.  Note  that  0.9  is  a  close 
approximatiion  of  the  smallest  eigenvalue. 

The  original  matrix  has  an  unfavorable  eigenvalue  distribution  (Figure  4.6);  the 
eigenvalues  we  seek  are  very  far  from  the  origin  and  close  to  Aio  =  1,  i.e.  the  con¬ 
vergence  rate  is  practically  1.  The  shifted  matrix  exhibits  a  much  better  eigenvalue 
distribution.  Some  of  the  eigenvalues  are  still  very  far  from  the  origin,  but  the  four 
targetted  eigenpairs  are  much  closer.  We  see  from  hgure  4.7  that  it  takes  roughly 
180  iterations  of  TraceMin  to  solve  the  problem  without  shifting,  but  it  takes  only 
12  iterations  to  solve  the  problem  with  the  shift  because  we  improved  the  eigenvalue 
distribution. 

4.6.1  Multiple  Ritz  shifts 

In  the  previous  example,  we  used  a  single  shift  for  all  of  the  Ritz  pairs.  This 
improved  the  convergence  rate  of  all  eigenpairs,  but  it  had  the  greatest  effect  on  the 
smallest  one  (since  the  shift  so  closely  approximated  the  smallest  eigenvalue).  Instead 


Afc+i 

{A  -  uB)  Yk 

Lk 

0 

(4.15) 


^In  my  Trilinos  implementation,  A  —  ojB  is  applied  implicitly.  I  did  this  to  accomodate  for  the  case 
where  A  and  B  are  not  available  explicitly. 


33 


(a)  Original  eigenvalue  distribution 
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(b)  Shifted  eigenvalue  distribution 


Figure  4.6.:  The  effect  of  Ritz  shifts  on  the  eigenvalue  spectrum 
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(a)  Original  convergence  rate 


(b)  Improved  convergence  rate 


Figure  4.7.:  The  effect  of  Ritz  shifts  on  convergence 
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of  using  a  single  shift,  we  could  use  separate  shifts  for  each  of  the  Ritz  pairs.  That 
would  result  in  solving  s  saddle  point  problems  per  TraceMin  iteration  of  the  form^ 

(A-ojiB)  BY 
BY^  0 

Note  that  these  saddle  point  problems  do  not  need  to  be  solved  separately.  We  may 
use  a  pseudo-block  Krylov  method  to  solve  these  linear  systems,  but  not  a  block 
Krylov  method®. 

If  each  shift  closely  approximates  the  corresponding  eigenvalue,  the  convergence 
rate  of  every  eigenpair  would  be  greatly  improved,  rather  than  just  the  convergence 
rate  of  the  smallest.  In  the  following  example,  we  will  see  how  the  use  of  multiple 
shifts  impacts  the  convergence  rate  of  TraceMin. 

R  is  a  synthetic  test  matrix  of  order  n  =  100  whose  eigenvalues  he  evenly  spaced 
in  the  interval  [0.91,10.9].  We  are  looking  for  the  four  smallest  eigenpairs  using  a 
subspace  of  dimension  9,  and  we  want  an  absolute  residual  of  le-5.  We  will  try 
TraceMin  with  no  shifts,  with  a  single  shift  of  0.9,  and  with  multiple  shifts  (0.9,  1.0, 
1.1,  1.2,  1.3,  1.4,  1.5,  1.6,  1.7).  Note  that  for  each  Ritz  shift  Wj,  Ui  ~  A*.  Figure  4.8 
shows  that  without  shifts,  TraceMin  will  not  converge  very  quickly  for  this  problem. 
If  we  use  a  single  shift  of  0.9,  the  smallest  eigenvalue  will  converge  quickly,  but  the 
others  will  take  much  longer.  If  we  use  multiple  shifts,  each  of  the  desired  eigenvalues 
should  converge  in  only  a  few  iterations. 

Figure  4.9  shows  us  that  the  use  of  multiple  shifts  reduces  the  trace  of  Y^AY 
much  faster  than  using  only  one  shift,  or  no  shifts  at  all.  Figure  4.10  shows  that  the 
use  of  multiple  shifts  also  lowers  the  residual  of  each  of  the  four  smallest  Ritz  pairs 
much  faster  than  the  other  shifting  strategies.  With  the  multiple  shifts,  it  took  only 
hve  iterations  for  TraceMin  to  hnd  the  four  smallest  eigenpairs.  Using  a  single  shift 

have  dropped  the  TraceMin  iteration  subscript  k  for  clarity. 

®Pseudo-block  Krylov  methods  are  mathematically  equivalent  to  solving  each  linear  system  indepen¬ 
dently.  The  only  difference  is  that  in  an  MPI  program,  several  messages  may  be  grouped  together, 
resulting  in  a  lower  communication  cost.  Block  Krylov  methods  build  one  subspace  which  is  used 
for  all  right  hand  sides,  rather  than  handling  each  independently.  It  would  not  make  sense  to  use  a 
block  Krylov  method  with  these  saddle  point  problems. 


di 

h 

{A  -  UiB)  yi 
0 


(4.16) 
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Eigenvalue  number 


Figure  4.8.:  The  effect  of  multiple  Ritz  shifts  on  TraceMin’s  convergence  rate 
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Figure  4.9.:  The  effect  of  multiple  Ritz  shifts  on  the  trace  reduction 


resulted  in  convergence  after  eleven  TraceMin  iterations.  Without  shifts,  TraceMin 
required  30  iterations  to  converge. 

We  will  now  consider  how  to  choose  the  optimal  shift  based  on  nothing  but  the 
Ritz  values  (approximate  eigenvalues)  and  their  corresponding  residuals. 

4.6.2  Choice  of  the  Ritz  shifts 

Choosing  how  and  when  to  shift  is  a  difficult  issue.  If  we  shift  too  aggressively, 
we  run  the  risk  of  converging  to  a  completely  different  set  of  eigenpairs  than  the  ones 
we  seek,  and  global  convergence  is  destroyed.  Shifting  very  conservatively  avoids  that 
problem,  but  it  is  detrimental  to  the  overall  running  time  of  the  program  because  we 
performed  many  unnecessary  TraceMin  iterations.  In  my  TraceMin  implementation, 
I  allow  the  user  to  choose  just  how  aggressive  he  wishes  to  be  with  the  shifts,  although 
the  default  options  tend  to  work  well.  The  user  can  choose  to  shift  at  every  iteration. 
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TraceMin  without  shifts 


TraceMin  with  muitipie  shifts 


Figure  4.10.:  The  effect  of  multiple  Ritz  shifts  on  convergence 
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after  the  trace  has  leveled  (i.e.  when  the  relative  change  in  trace  between  successive 
iterations  has  become  smaller  than  a  user  dehned  tolerance),  or  he  can  choose  to 
disable  shifting  entirely.  The  user  may  choose  the  shifts  as  being  equal  to  the  largest 
converged  eigenvalue,  the  adjusted  Ritz  values  (which  are  essentially  computed  as 
6i  —  ||rj||2  and  are  described  in  Algorithm  3),  or  the  current  Ritz  values.  He  may 
also  choose  whether  to  use  a  single  Ritz  shift  or  separate  ones  for  each  Ritz  pair.  My 
default  method  of  shifting  is  presented  in  Algorithm  3,  which  is  largely  based  on  the 
work  of  [17]. 

4.7  Relationship  between  TraceMin  and  simultaneous  iteration 

The  method  of  simultaneous  iteration  (Algorithm  4)  was  developed  by  Friedrich 
Bauer  in  1957  under  the  name  Treppeniteration  [18].  Note  that  this  method  is  math¬ 
ematically  equivalent  to  TraceMin,  if  we  solve  the  saddle  point  problem  by  computing 
the  Schur  complement  The  difference  between  the  original  1982  TraceMin  algo¬ 
rithm  and  simultaneous  iteration  is  that  TraceMin  (using  a  projected  Krylov  method 
to  solve  the  saddle  point  problem)  enforces  a  condition  that  A*,,  the  update  to  17, 
must  be  R-orthogonal  to  the  current  Ritz  vectors  17,  whereas  simultaneous  iteration 
only  enforces  that  condition  if  the  linear  systems  AV  =  BY  are  solved  to  a  high 
degree  of  precision.  As  a  result,  TraceMin  tends  to  perform  better  than  simultaneous 
iteration  when  the  linear  systems  are  solved  inexactly.  We  now  look  at  an  example 
comparing  the  two  methods. 

Our  test  matrix  A  is  a  synthetic  test  matrix  of  order  100,  with  a  condition  number 

of  1000.  We  seek  the  four  smallest  eigenpairs  with  an  absolute  residual  of  le-6,  using 

a  subspace  dimension  of  eight.  We  will  try  TraceMin,  using  projected-CG  to  solve  the 

saddle  point  problem,  and  simultaneous  iteration,  with  CG  to  solve  AV7+i  =  517- 

Figure  4.11  shows  what  happens  if  we  use  a  very  modest  tolerance  of  0.5  when  solving 

the  linear  systems  arising  in  each  iteration.  TraceMin  converges  in  roughly  30  itera- 

^If  Zk  «  A~^BYk,  TraceMin  computes  =  Zk  (Y^BZk)  whereas  the  method  of  simultaneous 
iteration  computes  =  Zk-  and  span  the  same  subspace. 
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Algorithm  3  Default  shift-selection  algorithm 
Require:  Subspace  dimension  s 

Computed  residual  R  =  AY  —  BY Q 
1:  Determine  whether  the  Ritz  values  are  clustered. 

Ritz  values  9i  and  are  in  a  cluster  if  9i  +  ||ri||2  >  ^j+i  ~  IID+1II2 
2:  For  each  cluster,  compute  the  residual  norm  of  that  cluster. 

If  9i,  9j,  and  9k  are  in  a  cluster,  the  residual  norm  of  that  cluster  is  I3ij^k  = 

IIDII2  +  IIDII2  +  Il’"*:ll2 

3:  if  at  least  one  eigenvalue  has  converged  then 
4:  ui  =  the  largest  converged  eigenvalue 

5:  else 

6:  Wi  =  0 

7:  end  if 

8:  if  9i  is  not  in  a  cluster  with  02  then 
9:  cui  =  max  (cui,  0i) 

10:  else 

11:  (Ui  =  max  (cui,  01  — /3i) 

12:  end  if 

13:  for  /c  =  2— )-s  —  Ido 

14:  if  uJk-i  =  0fc-i  and  0^  is  not  in  a  cluster  with  9k+i  then 

15:  CUk  —  9k 

16:  else  if  there  exists  a  0^  such  that  9i  <  9k  —  ||’"fc||2  then 

17:  ojk  =  the  largest  0j  satisfying  that  condition 

18:  else 

19:  Uk  =  UJk-l 

20:  end  if 

21:  end  for 

22:  OJs  =  OJs-1 
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Algorithm  4  Simultaneous  iteration  algorithm 
Require:  Subspace  dimension  s  >  p, 

Vi  e  with  rank  s, 

A  and  B  symmetric,  with  B  also  positive  dehnite 
1:  for  /c  =  1  — )■  maxit  do 
2:  B-orthonormalize  14 

3:  Perform  the  Rayleigh-Ritz  procedure  to  obtain  the  approximate  eigenpairs 

{AYu  ^  BYkQk) 

4:  Form  Rfc  =  AVk 

5:  Compute  all  eigenpairs  of  Hk,  HkXk  =  XkQk 

6:  Compute  the  Ritz  vectors  14  =  14^^ 

7:  Compute  the  residual  vectors  Rk  =  AYk  —  BYkQk 

8:  Test  for  convergence 

9:  Solve  the  set  of  linear  systems  A14+i  =  BYk 

10:  end  for 
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tions,  whereas  simultanous  iteration  fails  to  converge  within  300  iterations.  Clearly 
it  was  a  bad  idea  to  use  such  a  large  inner  tolerance  with  simultaneous  iteration,  so 
we  will  try  it  again  with  a  stricter  inner  tolerance  of  le-3.  Figure  4.12  shows  that 
even  with  that  stricter  inner  tolerance,  simultaneous  iteration  still  failed  to  converge 
within  300  iterations.  Figure  4.13  presents  a  comparison  of  TraceMin  and  simulta¬ 
neous  iteration  with  an  inner  tolerance  of  le-6.  Since  we  solved  the  linear  systems 
resulting  at  each  iteration  with  considerably  more  accuracy,  TraceMin  and  simultane¬ 
ous  iteration  converge  in  the  same  number  of  iterations.  We  also  see  that  they  both 
required  roughly  the  same  number  of  conjugate  gradient  iterations,  which  means  si¬ 
multaneous  iteration  is  the  faster  algorithm  in  this  case®.  However,  if  we  examine 
the  total  number  of  CG  iterations  required,  the  overall  best  approach  to  solving  this 
problem  was  to  use  TraceMin  with  an  inner  tolerance  of  0.5,  since  that  required  the 
fewest  conjugate  gradient  iterations  overall. 


®Each  conjugate  gradient  iteration  for  simultaneous  iteration  involved  applying  the  operator  A  to  a 
vector.  For  TraceMin,  each  conjugate  gradient  iteration  involved  applying  the  operator  PAP,  which 
requires  at  least  one  projection 
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(a)  Error  in  trace  vs.  number  of  TraceMin/simultaneous  iteration 
iterations 


(b)  Error  in  trace  vs.  number  of  conjugate  gradient  iterations 


Figure  4.11.:  A  comparison  of  TraceMin  and  simultaneous  iterations  using  a  lenient 
inner  tolerance 
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(a)  Error  in  trace  vs.  number  of  TraceMin/simultaneous  iteration 
iterations 


(b)  Error  in  trace  vs.  number  of  conjugate  gradient  iterations 


Figure  4.12.:  A  comparison  of  TraceMin  and  simultaneous  iterations  using  a  moderate 
inner  tolerance 
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(a)  Error  in  trace  vs.  number  of  TraceMin/simultaneous  iteration 
iterations 


(b)  Error  in  trace  vs.  number  of  conjugate  gradient  iterations 

Figure  4.13.:  A  comparison  of  TraceMin  and  simultaneous  iterations  using  a  strict 
inner  tolerance 
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5  TRACEMIN-DAVIDSON 

TraceMin-Davidson  is  an  eigensolver  very  similar  to  TraceMin.  The  only  real  differ¬ 
ence  is  that  while  TraceMin  uses  a  constant  subspace  dimension,  TraceMin-Davidson 
uses  expanding  subspaces.  In  every  iteration,  we  add  a  set  number  of  vectors  to 
our  subspace  V.  When  V  gets  to  be  too  large,  we  shrink  it,  keeping  only  the  most 
important  part  of  the  subspace,  i.e.  the  Ritz  vectors  corresponding  to  the  smallest 
Ritz  values.  Essentially,  TraceMin-Davidson  is  to  TraceMin  as  block-Lanczos  is  to 
simultaneous  iteration.  This  difference  is  outlined  in  Algorithm  5. 

Most  of  the  items  explored  in  the  previous  chapter  still  apply  here.  TraceMin- 
Davidson  converges  faster  if  the  eigenvalues  are  well  separated,  and  we  can  still  use 
Ritz  shifts  to  improve  the  convergence  rate. 

We  will  now  explore  some  of  TraceMin-Davidson’s  implementation  issues. 

5.1  Minimizing  redundant  computations 

At  each  TraceMin-Davidson  iteration,  we  add  s  new  vectors  to  our  subspace  V, 
but  the  rest  of  the  subspace  remains  constant.  In  this  section,  we  will  explore  how  to 
take  advantage  of  that  fact  in  order  to  minimize  the  amount  of  required  computations. 
The  R-orthonormalization  of  14  can  be  simplified  as  follows^: 

•  Project  the  vectors  of  RI4_i  out  of  Afc_i: 

Afc_i  ^  -  BVk-i  Afc_i 

•  R-orthonormalize  A^-i 

^This  simplification  comes  from  the  fact  that  14-1  was  already  B-orthonormalized  in  the  previous 
iteration. 
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Algorithm  5  TraceMin-Davidson  algorithm 
Require:  Block  size  s 

Maximum  subspace  dimension  d  >  2s, 

Vi  e  with  rank  s, 

A  and  B  symmetric,  with  B  also  positive  dehnite 
1:  Initialize  current  subspace  dimension  c  =  s 
2:  for  /c  =  1  — )■  maxit  do 
3:  B-orthonormalize  14 

4:  Perform  the  Rayleigh-Ritz  procedure  to  obtain  the  approximate  eigenpairs 

{AYk  ^  BYkQk) 

5:  Form  =  Vj^AVk 

6:  Compute  all  eigenpairs  of  H^,  HkXk  =  XkQk 

Assume  the  eigenvalues  are  sorted  in  ascending  order  0i  <  62  <  ■  ■  ■  <  Oc 
7:  Compute  the  Ritz  vectors  I4  =  VkXk 

Let  Yk^s  denote  the  s  Ritz  vectors  corresponding  to  the  smallest  Ritz  values 
8:  Compute  the  residual  vectors  Rk  =  AYk  —  BYkQk 

9:  Test  for  convergence 

10:  if  c  +  s  >  d  then 

11:  Restart  with  I4  =  14, s  and  c  =  s 

12:  end  if 

13:  Solve  the  saddle  point  problem 


A 

to 

_ 1 

Ak 

A 

_ 1 

1 - 

to 

0 

Lk 

0 

approximately  to  obtain 
14:  Add  Ak  to  the  subspace,  I4+1  =  [I4  A^.] 

15:  end  for 
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•  Add  Afc_i  to  the  subspace: 

\4  =  A,_i] 

Additionally,  we  only  need  to  compute  the  s  new  vectors  of  AVk  and  the  corresponding 
columns  of  Hk  =  AVk- 

5.2  Selecting  the  block  size 

Unlike  TraceMin,  we  can  run  TraceMin-Davidson  with  a  block  size  smaller  than 
the  number  of  desired  eigenpairs.  Using  a  larger  block  size  involves  more  work  per 
TraceMin-Davidson  iteration  but  has  the  potential  to  reduce  the  number  of  required 
TraceMin-Davidson  iterations.  Additionally,  using  a  block  size  s  >  1  allows  us  to 
use  block  operations  in  the  linear  solve  step,  which  can  be  more  effective  on  parallel 
architectures.  One  further  consideration  in  choosing  the  block  size  is  its  effect  on 
global  convergence.  If  the  block  size  is  smaller  than  the  multiplicity  of  the  eigenvalues 
sought,  we  may  miss  the  correct  multiplicity.  I  will  now  demonstrate  this  with  an 
example. 

We  seek  the  four  smallest  eigenpairs  of  the  3D  discretization  of  the  Laplace  opera¬ 
tor  on  a  unit  cube  of  order  n  =  1000.  The  four  smallest  eigenvalues  are  approximately 
(0.243, 0.480, 0.480,  0.480).  Note  that  the  second  eigenvalue  has  a  multiplicity  of  three. 
We  will  run  TraceMin-Davidson  twice,  once  with  a  block  size  of  s  =  1,  and  once  with 
a  block  size  of  s  =  4.  Our  initial  subspace  contains  four  vectors,  and  we  do  not  use 
restarts.  In  both  cases,  we  consider  an  eigenpair  as  having  converged  if  the  absolute 
residual  ||r  =  Ay  —  dBy\\2  <  10“^.  Figure  5.1  shows  that  with  a  block  size  of  s  =  4, 
TraceMin-Davidson  converges  to  the  true  eigenvalues  (plotted  as  black  circles)  after 
10  iterations.  With  a  block  size  of  s  =  1,  TraceMin-Davidson  reports  convergence 
after  22  iterations.  With  the  larger  block  size,  TraceMin-Davidson  would  likely  scale 
better  due  to  its  capacity  to  use  block  operations  during  the  linear  solve  step.  The 
most  important  reason  to  use  a  larger  block  size  here  though  is  this:  when  we  used  a 
block  size  of  s  =  1,  we  did  not  converge  to  the  correct  eigenpairs.  TraceMin-Davidson 
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returned  four  eigenpairs  with  a  residual  ||r||2  <  10“^,  but  the  fourth  one  it  returned 
was  0.716,  which  is  the  fifth  eigenvalue  of  this  particular  problem.  If  this  were  not 
a  small  synthetic  problem,  we  would  have  no  idea  that  the  eigenpairs  returned  by 
TraceMin-Davidson  were  incorrect.  For  this  reason,  I  set  the  default  block  size  to  be 
the  same  as  the  desired  number  of  eigenpairs  in  my  TraceMin-Davidson  implemen¬ 
tation.  If  the  user  has  some  knowledge  about  the  spectrum,  he  may  choose  to  use  a 
smaller  block  size. 

5.3  Computing  harmonic  Ritz  values 

Each  TraceMin-Davidson  iteration  essentially  consists  of  two  parts:  compute  a 
R-orthonormal  basis  R  of  a  subspace  IK,  then  compute  an  approximate  eigenvector^ 
y  based  on  that  subspace  such  that 


yeK  (5.2) 

and  the  Galerkin  condition 

r  ±K  (5.3) 

is  satished,  where  r  =  Ay  —  9 By.  From  conditions  5.2  and  5.3,  we  know  that  y  =  Vu 
where  u  is  the  eigenvector  corresponding  to  the  desired  eigenvalue  of  the  following 
small  dense  problem 

V^AVu  =  Ou  (5.4) 

This  is  commonly  known  as  the  Rayleigh- Ritz  procednre.  These  Ritz  pairs  (0,  y) 
tend  to  approximate  extreme  eigenpairs  better  than  interior  ones  [19].  If  we  wish  to 
compnte  interior  eigenpairs,  we  may  instead  compute  the  harmonic  Ritz  pairs. 

Instead  of  using  an  orthogonal  projection  method  as  we  did  before,  we  can  use  an 
oblique  projection  method.  Let  IK  be  the  space  spanned  by  the  vectors  of  V  as  before 
^Here,  we  focus  on  the  single  vector  case,  but  this  also  applies  to  blocks  of  vectors. 


Ritz  values  Ritz  values 
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TraceMin-Davidson  with  block  size  4 


TraceMin-Davidson  with  block  size  1 


Figure  5.1.:  The  effect  of  block  size  on  TraceMin-Davidson’s  convergence 


Error  in  Ritz  values  Error  in  Ritz  values 


51 


TraceMin-Davidson  with  block  size  1 


TraceMin-Davidson  with  block  size  4 


Figure  5.2.:  The  effect  of  block  size  on  TraceMin-Davidson’s  convergence  (continued) 
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and  L  =  L  ^AL  where  B  =  LL^  is  the  Choleski  factorization  of  B.  We  seek  an 
approximate  eigenvector  y  such  that 

y  G  IK  (5.5) 

and  the  Petrov-Galerkin  condition 

r  ±  L  (5.6) 

As  before,  we  have  y  =  Vu  where  u  is  the  eigenvector  corresponding  to  the  desired 
eigenvalue  of  a  slightly  different  eigenvalue  problem 

V^AB-^AVu  =  eV^AVu  (5.7) 

where  V  =  L~'^V\  if  this  is  a  standard  eigenvalue  problem  {B  =  /),  we  have 

V^A^Vu  =  eV^AVu  (5.8) 

instead.  Assuming  W  =  L~^AV  is  an  orthonormal  basis  of  L,  our  problem  can  be 
rewritten  as 

W'^A-^Wu  =  \u  (5.9) 

U 

(where  W  =  LW)  making  this  method  mathematically  equivalent  to  using  an  or¬ 
thogonal  projection  process  for  computing  the  eigenpairs  of  A~^ .  The  harmonic  Ritz 
vectors  maximize  Rayleigh  quotients  for  A~^,  so  they  can  be  interpreted  as  the  best 
information  one  has  for  the  smallest  magnitude  eigenvalues  [19].  I  will  now  present 
a  small  problem  demonstrating  how  the  use  of  harmonic  Ritz  values  can  beneht 
TraceMin-Davidson. 

We  wish  to  compute  the  four  smallest  eigenpairs  of  an  Anderson  matrix  of  order 
n  =  64  with  an  absolute  residual  ||r||2  <  10“^;  the  eigenvalues  of  interest  are  (0.1391, 
-0.3688,  0.5609,  -0.9419).  We  will  use  a  block  size  of  s  =  1,  an  initial  subspace  of  hve 
vectors,  and  no  restarts.  Figure  5.3  shows  that  over  time,  we  start  converging  to  the 
correct  eigenvalues  (plotted  as  black  circles)  whether  or  not  we  compute  the  harmonic 
Ritz  values.  However,  we  see  far  more  oscillation  in  the  standard  Ritz  values  than  the 
harmonic  ones.  For  instance,  we  appear  to  have  “lost”  the  Ritz  value  approximating 
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-0.9419  at  iteration  21.  That  Ritz  vector  is  still  in  the  subspace;  the  corresponding 
Ritz  value  just  isn’t  the  smallest.  That  may  seem  like  a  minor  detail  at  first,  but 
remember  that  when  using  an  eigensolver  with  expanding  subspaces,  the  order  of  the 
Ritz  pairs  matters.  We  are  going  to  use  the  first  Ritz  vector  to  compute  the  next 
addition  to  the  subspace.  If  the  vectors  were  sorted  poorly,  we  will  not  get  a  good 
addition  to  the  subspace.  When  the  Ritz  values  are  sorted  in  such  a  way,  we  may  face 
trouble  computing  Ritz  shifts.  We  may  even  accidentally  discard  the  Ritz  vectors 
of  interest  upon  restarting,  which  would  be  disastrous.  By  computing  the  harmonic 
Ritz  values,  we  have  changed  how  the  Ritz  pairs  are  sorted  and  see  far  less  of  this 
concerning  oscillatory  behavior. 


Ritz  values  Ritz  values 
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TraceMin-Davidson  with  standard  Ritz  vaiues 


Number  of  TraceMin-Davidson  iterations 


TraceMin-Davidson  with  harmonic  Ritz  vaiues 


Number  of  TraceMin-Davidson  iterations 


Figure  5.3.:  A  comparison  of  TraceMin-Davidson  with  standard  and  harmonic  Ritz 
values 


Error  in  Ritz  values  Error  in  Ritz  values 
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TraceMin-Davidson  with  standard  Ritz  vaiues 


TraceMin-Davidson  with  harmonic  Ritz  vaiues 


Figure  5.4.:  A  comparison  of  TraceMin-Davidson  with  standard  and  harmonic  Ritz 
values  (continued) 


Error  in  trace 
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Figure  5.5.:  A  comparison  of  TraceMin-Davidson  with  standard  and  harmonic  Ritz 
values  (continued) 
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6  IMPLEMENTATIONS 

So  far,  we  have  discussed  how  to  hnd  a  few  of  the  smallest  eigenpairs  of  the  generalized 
eigenvalue  problem  Ax  =  XBx,  but  what  if  we  have  a  different  goal?  For  instance, 
we  may  want  to  compute  a  few  of  the  largest  eigenpairs,  or  we  may  wish  to  compute 
all  of  the  eigenpairs  within  some  interval.  This  chapter  will  address  how  we  modify 
TraceMin  to  solve  various  types  of  problems,  divided  into  the  following  categories. 

•  Finding  a  very  small  number  of  interior  or  extreme  eigenpairs  (using  TraceMIN- 
Standard 

•  Finding  out  whether  there  are  any  eigenvalues  in  a  given  interval,  and  Ending 
a  small  subset  of  eigenpairs  if  they  exist  (using  TraceMIN-Sampling) 

•  Finding  all  eigenpairs  in  an  interval,  preferably  one  containing  many  eigenpairs 
(nsing  TraceMIN-Multisectioning) 

6.1  Computing  a  few  eigenpairs:  TraceMin-Standard 

In  this  section,  we  will  discuss  how  to  use  spectral  transformations  to  compnte 
a  difierent  target  than  the  eigenpairs  of  smallest  magnitnde.  We  will  also  discuss  a 
special  case  known  as  the  compntation  of  the  Fiedler  vector. 

6.1.1  Compnting  the  eigenvalues  of  largest  magnitude 

If  we  wish  to  find  the  eigenvalnes  of  largest  magnitude  (with  their  associated 
eigenvectors)  of  the  problem 

Ax  =  XBx  (6.1) 

Although  I  refer  to  this  implementation  as  TraceMin-Standard,  nothing  would  change  if  we  used 
TraceMin-Davidson  instead.  This  also  applies  to  TraceMin- Sampling  and  TraceMin-Multisectioning. 
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where  A  and  B  are  symmetric  positive  definite,  this  is  equivalent  to  computing  the 
smallest  eigenvalues  of 

Bx  =  a  Ax  (6.2) 

where  a  =  j.  We  can  run  TraceMin  on  the  problem  of  Equation  6.2  to  obtain  the 
solution  of  our  target  problem. 

Note  that  if  we  wish  to  compute  the  largest  eigenvalues  of  a  standard  eigenvalue 
problem  {B  =  /),  TraceMin’s  saddle  point  problem  is  greatly  simplified.  The  solution 
to  the  saddle  point  problem  is 

V  =  AY  {Y^ A^Y)~^  (6.3) 

All  that  is  required  to  solve  this  problem  is  an  inner  product  and  the  solution  of  a 
small  dense  linear  system  with  many  right  hand  sides^.  We  do  not  need  to  solve  a 
single  linear  system  of  the  form  Ax  =  b. 

If  we  wish  to  compute  the  largest  eigenvalues  of  a  standard  eigenvalue  problem, 
Ritz  shifts  should  be  disabled,  since  they  would  require  solving  linear  systems  of  the 
form  (J  —  uA)  x  =  b,  where  u  is  our  desired  shift.  Even  if  the  eigenvalue  distribution 
is  poor,  performing  many  iterations  without  the  shift  will  probably  be  cheaper  than 
solving  the  linear  systems  required  by  the  use  of  a  shift. 

6.1.2  Computing  the  eigenvalues  closest  to  a  given  value 

If  we  would  like  to  compute  the  eigenpairs  nearest  a  given  value  a,  we  need  only 
solve  the  eigenproblem 

{A  —  aB)  X  =  {\  —  a)  Bx  (6.4) 

If  (A  —  a,  x)  is  an  eigenpair  of  (6.4),  then  (A,  x)  is  an  eigenpair  of  the  original  problem 
Ax  =  \Bx. 


^Technically,  we  do  not  need  to  solve  those  small  dense  linear  systems  because  AY  and  AY  (Y^A'^Y) 
lie  in  the  same  subspace. 
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6.1.3  Computing  the  absolute  smallest  eigenvalues 

If  we  wish  to  compute  the  absolute  smallest  eigenvalues  of  a  matrix,  we  may 
simply  shift  to  the  left  edge  of  the  spectrum  and  run  TraceMin  as  we  normally  would. 
However,  this  requires  us  to  be  able  to  bound  the  eigenvalues.  We  can  use  the 
Gerschgorin  circle  theorem  to  determine  some  (3  such  that  A  —  /3B  is  symmetric 
positive  semi-dehnite  and  compute  the  smallest  magnitude  eigenpairs  of  6.4.  The 
produced  by  the  Gerschgorin  circle  theorem  tends  to  be  far  from  the  smallest 
eigenvalue,  which  means  TraceMin  will  have  a  lackluster  convergence  rate,  but  we 
have  already  studied  how  dynamic  Ritz  shifts  can  help  solve  this  problem. 

6.1.4  Computing  the  Fiedler  vector 

In  this  case,  we  are  interested  in  the  eigenvector  corresponding  to  the  smallest 
nonzero  eigenvalue  of  a  standard  eigenvalue  problem  Ax  =  Xx,  where  A  is  symmetric 
positive  semi-dehnite.  This  eigenvector,  known  as  the  Fiedler  vector,  tells  us  how  to 
reorder  a  matrix  to  either  reduce  the  bandwidth  or  bring  large  elements  toward  the 
diagonal.  If  the  graph  consists  of  only  one  strongly  connected  component^,  the  graph 
Laplacian  H’s  null  space  is  exactly  one  vector,  the  vector  of  all  Is^. 

The  fact  that  A  is  singular  can  cause  problems  for  some  eigensolvers.  However, 
we  have  already  demonstrated  that  TraceMin  does  not  rely  on  accurate  linear  solves. 
Furthermore,  if  we  project  this  null  vector  out  of  the  set  of  basis  vectors  14  at  each 
TraceMin  iteration,  all  of  the  linear  systems  we  solve  will  be  consistent. 

the  graph  consists  of  multiple  strongly  connected  components,  it  can  be  split  up  into  many 
smaller  eigenproblems,  one  per  strongly  connected  component. 

^We  assume  for  simplicity  that  A  was  not  normalized. 
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6.1.5  Computing  interior  eigenpairs  via  spectrum  folding 


If  we  wish  to  compute  the  interior  eigenpairs  of  a  standard  eigenvalue  problem 
Ax  =  Ax,  we  may  instead  solve  the  equivalent  eigenvalue  problem 


=  A^x 


(6.5) 


This  is  known  as  spectrum  folding^.  Instead  of  seeking  the  smallest  magnitude  eigen¬ 
pairs  of  the  symmetric  indehnite  matrix  A,  i.e.  the  eigenvalues  closest  to  zero  (which 
can  be  positive  or  negative),  we  seek  the  smallest  magnitude  eigenpairs  of  the  sym¬ 
metric  positive  dehnite  operator  Note  that  it  is  a  bad  idea  to  form  the  matrix  A'^ 
explicitly,  so  we  will  apply  that  operator  implicitly.  Working  with  this  operator  has 
several  side  effects.  The  most  obvious  side  effect  is  that  we  have  squared  the  condition 
number  of  the  matrix,  so  it  is  now  more  difficult  to  solve  the  linear  systems  arising  at 
each  TraceMin  iteration.  Again,  TraceMin  does  not  rely  on  accurate  linear  solves,  so 
this  should  not  impede  TraceMin’s  convergence  as  much  as  other  eigensolvers.  The 
other  effect  of  this  transformation  is  that  instead  of  each  eigenpair  converging  at  the 
rate 


A. 

As+i 


(6.6) 


they  will  now  converge  at  the  faster  rate® 


A. 


(6.7) 


Also  note  that  instead  of  solving  a  linear  system  of  the  form  A^x  =  6  at  each  TraceMin 

iteration,  we  can  instead  solve  two  linear  systems  each  of  the  form  Ax  =  b. 

To  demonstrate  the  effect  of  spectrum  folding,  let  us  again  solve  the  Anderson 

problem  of  section  5.3.  We  will  compute  the  four  smallest  eigenpairs  of  that  same 

Anderson  matrix  of  order  n  =  64  with  an  absolute  residual  ||r||2  <  10“®  using  a  block 

^TraceMin  and  TraceMin-Davidson  are  both  capable  of  computing  interior  eigenpairs  without  using 
spectrum  folding,  but  the  trace-minimization  property  no  longer  holds.  If  we  use  spectrum  folding, 
TraceMin’s  global  convergence  proof  still  holds. 

®This  convergence  rate  helps  us  to  estimate  how  many  iterations  TraceMin  will  require  if  the  linear 
systems  are  solved  directly.  When  we  use  an  iterative  solver,  we  will  likely  need  more  TraceMin 
iterations  to  converge. 
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size  of  s  =  1,  an  initial  subspace  of  five  vectors,  and  no  restarts.  As  a  reminder, 
the  eigenvalues  of  interest  are  (0.1391,  -0.3688,  0.5609,  -.9419).  This  time,  we  will 
try  spectrum  folding  to  compute  the  eigenpairs  of  smallest  magnitude.  Figure  6.1 
shows  that  whether  we  used  the  standard  Ritz  extraction,  harmonic  Ritz  extraction, 
or  spectrum  folding,  we  converged  to  the  correct  eigenvalues.  Using  a  harmonic  Ritz 
extraction,  TraceMin-Davidson  took  23  iterations  to  converge;  with  spectrum  folding, 
we  only  required  17  TraceMin-Davidson  iterations  (since  the  spectrum  of  eigenvalues 
was  improved).  Because  the  matrix  was  so  small,  I  formed  explicitly  for  spectrum 
folding  and  used  a  direct  solver  to  solve  the  saddle  point  problem  arising  at  each 
TraceMin-Davidson  iteration.  As  a  result,  spectrum  folding  would  have  been  the 
fastest  method  in  this  case.  For  larger  problems  where  that  is  infeasible,  spectrum 
folding  will  require  roughly  twice  as  much  work  per  TraceMin-Davidson  iteration  (as 
compared  to  not  using  spectrum  folding  on  that  same  problem).  If  spectrum  folding 
reduces  the  number  of  required  TraceMin-Davidson  iterations  by  a  factor  of  two  or 
greater,  it  will  be  effective;  otherwise,  it  may  result  in  a  longer  running  time. 

Although  it  may  appear  in  hgures  6.1,  6.2,  and  6.3  that  the  trace  is  not  decreasing 
monotonically  in  the  case  of  spectrum  folding,  that  is  only  because  the  trace  presented 
is  the  trace  of  In  spectrum  folding,  the  trace  of  X^A^X  is  being  decreased 

monotonically  (as  hgures  6.4  and  6.5)  demonstrate.  When  the  Rayleigh  quotients 
xj A^Xi  become  very  close  to  (the  square  of  the  true  eigenvalues),  the  Rayleigh 
quotients  xjAxi  approach  A*  as  well. 

Figures  6.6  and  6.7  present  a  comparison  of  TraceMin  and  TraceMin-Davidson  on 
that  same  Anderson  problem.  Here,  TraceMin  was  run  with  a  subspace  dimension 
of  s  =  8,  meaning  it  will  solve  a  linear  system  with  eight  right  hand  sides  at  each 
iteration.  Figure  6.6  shows  that  with  spectrum  folding,  TraceMin  converged  in  12 
iterations,  whereas  TraceMin-Davidson  converged  in  17.  The  results  are  similar  if  we 
neglect  to  use  spectrum  folding  (hgure  6.7);  TraceMin  required  fewer  iterations  and 
decreased  the  trace  at  a  higher  rate. 


Ritz  values  Ritz  values  Ritz  values 
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TraceMin-Davidson  with  standard  Ritz  vaiues 


Number  of  TraceMin-Davidson  iterations 


TraceMin-Davidson  with  harmonic  Ritz  vaiues 


Figure  6.1.:  A  demonstration  of  the  effect  of  spectrum  folding 


Error  in  Ritz  values  Error  in  Ritz  values  Error  in  Ritz  values 
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TraceMin-Davidson  with  spectrum  folding 


TraceMin-Davidson  with  standard  Ritz  values 


TraceMin-Davidson  with  harmonic  Ritz  values 


Figure  6.2.:  A  demonstration  of  the  effect  of  spectrum  folding  (continued) 


Error  in  trace 
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Figure  6.3.:  A  demonstration  of  the  effect  of  spectrum  folding  (continued) 


Rayleigh  quotient  of  A 
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Figure  6.4.:  A  demonstration  of  the  effect  of  spectrum  folding  (continued) 


Error  in  Rayleigh  quotient  of  A  Error  in  Rayleigh  quotient  of  A 
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Figure  6.5.:  A  demonstration  of  the  effect  of  spectrum  folding  (continued) 


Error  in  trace 
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Number  of  eigensolver  iterations 


Number  of  eigensolver  iterations 


Figure  6.6.:  A  comparison  of  TraceMin  and  TraceMin-Davidson  on  an  indefinite 
problem  (with  spectrum  folding) 


Error  in  trace 
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Figure  6.7.:  A  comparison  of  TraceMin  and  TraceMin-Davidson  on  an  indefinite 
problem  (without  spectrum  folding) 
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6.1.6  My  parallel  TraceMin-Standard  implementation 

I  chose  to  write  my  TraceMin  and  TraceMin-Davidson  implementations  using  the 
Trilinos  framework;  my  code  is  publicly  available  and  can  be  downloaded  from  the 
Trilinos  website  [20] .  Trilinos  contains  a  variety  of  linear  system  and  eigenvalue  prob¬ 
lem  solvers  (among  other  things)  similar  to  PETSc/SLEPc  [21-26],  but  unlike  PETSc, 
Trilinos  supports  block  linear  solves.  Trilinos  has  impressive  parallel  scalability,  sup¬ 
ports  very  large  problems,  and  is  effective  on  many  different  architectures  including 
GPUs.  To  make  the  spectral  transformations  of  this  section  easier  for  the  user,  I 
provided  several  example  drivers  demonstrating  their  use,  all  of  which  are  available 
in  the  Trilinos  Doxygen  documentation  [27]. 

The  sparse  matrices  A  and  B  are  stored  in  compressed  sparse  row  format,  using 
a  block  row  distribution  Tall  dense  matrices  such  as  V  and  V  (referred  to  in 
Trilinos  as  multivectors)  are  stored  using  a  block  row  distribution  as  well.  Small 
dense  matrices  such  as  TT  =  V'^ AV  are  replicated  rather  than  distributed;  each  MPI 
process  owns  a  copy.  Using  this  data  distribution,  the  following  distributed  kernels 
are  required 

•  sparse  matrix  times  multivector  multiplication  (referred  to  as  a  matvec) 

•  inner  product  of  two  multivectors 

•  5-orthonormahzation  of  a  multivector 

•  solution  of  a  sparse  symmetric  linear  system  with  multiple  right  hand  sides 

Within  a  node,  we  can  choose  to  use  OpenMP  for  shared  memory  parallelism,  CUBA 

for  GPUs,  or  we  can  simply  spawn  more  MPI  processes®. 

My  Trilinos  implementation  of  TraceMin  uses  the  matvec,  inner  product,  and 

i?-orthnormalization  routines  dehned  in  Trilinos,  although  it  also  allows  users  to 

^This  is  how  I  chose  to  store  the  matrices  for  the  test  cases  I  will  present  later,  but  it  is  not  a 
requirement. 

^Trilinos  allows  its  users  to  specify  a  node  type  and  switches  its  strategy  for  handling  shared  memory 
computations  based  on  that  node  type. 
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provide  their  own  implementations  of  any  of  these  operations  The  Tpetra  matvec 
has  been  optimized  to  take  advantage  of  the  sparsity  pattern  of  the  matrix,  and  it 
performs  the  minimum  amount  of  communication  required.  Trilinos  also  contains 
three  different  orthonormalization  routines  which  can  be  used  in  parallel;  modihed 
Gram-Schmidt  [28],  tall  skinny  QR  [29,30],  or  by  using  algorithm  6  which  uses  the 
eigendecomposition  of  a  small  dense  matrix  to  R-orthonormalize  a  set  of  vectors. 

The  Belos  package  of  Trilinos  contains  many  Krylov  solvers  capable  of  solving  lin¬ 
ear  systems  with  multiple  right  hand  sides.  Some  of  these  are  block  methods  which 
build  one  shared  Krylov  subspace  for  all  right  hand  sides,  and  others  are  pseudo¬ 
block,  meaning  they  are  mathematically  equivalent  to  solving  each  linear  system  in¬ 
dependently,  but  the  communication  and  memory  accesses  are  more  effiecient.  Block 
methods  can  solve  sets  of  linear  systems  of  the  form  Axi  =  bi,  where  all  right  hand 
sides  bi  are  available  simultaneously.  In  the  case  of  TraceMin,  we  end  up  having 
to  solve  linear  systems  of  the  form  [A  —  UiB)  Xi  =  bi  if  we  choose  to  use  multiple 
dynamic  Ritz  shifts.  Belos’  pseudoblock  Krylov  solvers  are  currently  incapable  of 
solving  linear  systems  with  indexed  operators  such  as  we  have  here,  so  I  wrote  my 
own  pseudo-block  MINRES  which  accepts  indexed  operators.  This  MINRES  uses  the 
efficient  parallel  kernels  I  mentioned  previously. 

Algorithm  6  Orthnormalization  via  eigendecomposition 
Require:  B  G  symmetric  positive  dehnite 

Void  G  with  rank  s  is  the  set  of  vectors  to  be  R-orthonormalized 

1:  Form  H2  =  V;^i^BVoid 

2:  Compute  the  eigendecomposition  of  H2,  H2X2  =  X2Q2 

-1 

3:  Form  Vnew  =  VoidX2V)2  ,  which  is  R-orthonormal 


®My  TraceMin  implementation  only  needs  to  know  how  a  matrix  multiplication  works,  as  well  as 
how  certain  vector  operations  are  performed;  it  does  not  need  the  matrices  or  vectors  to  be  stored  in 
any  specific  way.  This  allows  users  to  take  advantage  of  any  special  structure  their  matrices  might 
have,  or  write  code  that  performs  well  on  unique  architectures.  It  also  means  that  TraceMin  can  be 
used  to  solve  problems  where  the  matrix  is  never  made  explicitly  available. 
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6.2  TraceMIN-Sampling 

We  are  now  interested  in  finding  out  whether  any  eigenvalues  exist  within  a  certain 
interval.  If  so,  we  must  obtain  a  few  eigenpairs  near  a  set  of  shifts  within  that  interval. 
These  shifts  can  be  handled  completely  independently. 

We  spawn  a  number  of  MPI  processes  equal  to  the  number  of  desired  shifts.  As¬ 
suming  sufficient  space  is  available,  matrices  A  and  B  are  replicated  on  each  node. 
Then  we  run  a  separate  instance  of  TraceMIN  on  each  node,  using  a  spectral  trans¬ 
formation  to  hnd  the  eigenpairs  nearest  a  given  shift.  This  process  requires  no  com¬ 
munication  across  nodes.  The  only  thing  hindering  parallel  scalability  is  the  potential 
for  load  imbalance. 

Note  that  while  this  choice  of  MPI  processes  is  optimal  from  a  scalability  stand¬ 
point,  it  is  by  no  means  required.  If  it  is  infeasible  to  replicate  A  and  B  on  each 
node,  we  may  also  divide  our  MPI  processes  into  small  groups,  perhaps  4  processes 
per  group.  Then,  each  group  of  processes  would  store  the  matrices  in  a  distributed 
fashion.  Instead  of  running  one  instance  of  TraceMin  per  MPI  process,  we  could  then 
run  one  instance  of  TraceMin  per  group.  There  would  then  be  a  small  amount  of 
MPI  communication,  but  it  would  be  limited  to  the  processes  within  the  individual 
groups.  There  would  be  no  global  communication  required. 

If  the  number  of  desired  shifts  is  greater  than  the  number  of  groups  of  MPI 
processes,  each  group  would  be  assigned  a  small  subset  of  shifts  and  run  TraceMin 
once  for  each  shift.  A  potential  load  balancing  strategy  for  such  a  case  will  be  explored 
in  the  next  section. 

6.3  TraceMIN-Multisectioning 

In  this  case,  we  want  to  hnd  all  the  eigenpairs  within  a  given  interval  (which  I 
will  refer  to  as  the  global  interval).  Assuming  this  interval  contains  many  eigenpairs, 
it  would  be  impractical  to  run  a  single  instance  of  TraceMIN  to  compute  all  the 
eigenpairs  together;  we  might  not  even  have  enough  space  to  store  all  of  the  required 
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eigenvectors.  We  will  need  to  break  the  interval  up  into  smaller  pieces,  each  of  which 
can  be  solved  independently. 

I  propose  a  method  similar  in  nature  to  adaptive  quadrature.  In  adaptive  quadra¬ 
ture,  you  want  to  calculate  the  integral  of  some  function  on  a  given  interval.  If  the 
interval  is  “bad,”  namely  the  error  estimate  is  too  large,  then  you  break  it  in  two 
and  repeat  the  process  with  each  half.  You  continue  recursing  in  this  way  until  you 
have  a  set  of  satisfactory  intervals;  whether  an  interval  is  satisfactory  is  defined  by  a 
tolerance  parameter. 

In  the  case  of  TraceMIN-MuItisectioning,  we  start  with  some  global  interval  of 
interest  just  like  in  adaptive  quadrature.  Then,  we  evaluate  whether  the  interval  is 
“bad,”  meaning  it  contains  too  many  eigenvalues.  If  so,  it  is  divided  in  half  and 
the  procedure  is  repeated.  This  process  continues  recursively  until  we  have  a  set  of 
satisfactory  intervals;  each  interval  must  contain  at  most  Ue  eigenvalues,  where  rie  is 
a  parameter  dehned  by  the  user.  Figure  6.8  illustrates  the  multisectioning  procedure. 

Let  us  assume  an  interval  containing  20  eigenvalues  is  sufficiently  small  (i.e.  = 

20).  In  the  hrst  image,  we  start  with  the  interval  containing  all  eigenvalues  in  the 
range  [0, 1000].  We  know  there  are  50  eigenvalues  in  that  interval.  Since  that  is  too 
many,  we  divide  the  interval  in  half  and  obtain  two  smaller  intervals:  [0,500],  which 
contains  30  eigenvalues,  and  [500, 1000],  which  contains  20.  The  second  subinterval  is 
sufficiently  small  and  does  not  need  to  be  subdivided  further.  The  first  one,  however, 
gets  divided  into  the  intervals  [0,  250]  and  [250,  500],  each  of  which  contain  fewer  than 
20  eigenvalues.  In  the  end,  rather  than  running  TraceMIN  on  the  interval  [0, 1000], 
we  run  3  independent  instances  of  TraceMIN  on  the  intervals  [0,250],  [250,500],  and 
[500,1000]. 

6.3.1  Obtaining  the  number  of  eigenvalues  in  an  interval 

To  obtain  the  number  of  eigenvalues  in  a  particular  interval,  we  use  a  sparse  fac¬ 
torization  method  such  as  PARDISO  [31,32],  MUMPS  [33,34],  or  WSMP  to  compute 


(c)  Final  result 


Figure  6.8.:  An  example  of  interval  subdivision  for  multisectioning 
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the  inertia  of  a  shifted  matrix  [35] .  If  A  —  aB  has  pi  positive  eigenvalues,  and  A  —  bB 
has  p2  positive  eigenvalues,  then  Ax  =  XBx  has  pi  —  p2  eigenvalues  in  the  interval 
(a,  6). 

We  now  turn  our  attention  to  how  these  small  intervals  should  be  assigned  to 
various  MPI  processes. 

6.3.2  Assigning  the  work 

This  section  describes  our  two  implementations  of  TraceMIN-Multisectioning  and 
how  they  divide  the  work. 

Static  work  allocation 

One  way  to  implement  this  is  to  use  a  static  work  allocation.  Each  MPI  process  is 
assigned  a  segment  of  the  large  global  interval  to  subdivide  (as  in  Figure  6.9a).  Figure 
6.9b  shows  the  result  of  our  subdivision:  each  MPI  process  now  owns  a  set  of  small 
intervals.  Some  intervals  may  turn  out  to  be  empty  and  get  discarded;  the  black  line 
under  process  0  denotes  an  empty  interval  that  got  discarded.  Each  of  these  MPI 
processes  now  has  a  different  amount  of  work,  so  we  will  perform  one  communication 
where  the  work  gets  redistributed  so  that  each  MPI  process  is  given  a  roughly  equal 
number  of  subintervals  on  which  to  run  TraceMin.  The  redistributed  work  is  shown 
in  Figure  6.9c. 

This  implementation  has  the  advantage  of  requiring  absolutely  no  communication 
after  the  work  has  been  divided  amongst  the  processes.  However,  there  exists  the 
potential  for  a  high  load  imbalance  for  two  reasons.  First  of  all,  the  number  of 
subintervals  may  not  be  evenly  divisible  by  the  number  of  processes.  For  instance, 
if  we  obtain  Eve  subintervals  from  the  recursive  division  of  the  large  global  interval 
with  four  MPI  processes,  one  process  will  be  assigned  twice  as  many  intervals  as 
the  others.  More  importantly  though,  the  number  of  assigned  intervals  is  not  a 
good  estimate  of  the  amount  of  work,  since  different  intervals  may  require  vastly 
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(a)  Each  MPI  process  has  one  part  of  the  interval  to  subdivide 


0 


1 


2 


(b)  Each  MPI  process  has  subdivided  their  initial  interval  and  now  owns  several  smaller 
intervals 


0 


1  2 


(c)  The  final  work  distribution 


Figure  6.9.:  An  example  of  static  work  allocation  for  TraceMin-Multisectioning  with 
3  MPI  processes 

different  amounts  of  work.  One  would  expect  intervals  containing  more  eigenvalues 
to  be  more  computationally  intensive,  but  the  factor  that  most  greatly  influences  the 
running  time  is  the  distribution  of  eigenvalues  in  each  interval,  which  is  unknown  until 
we’ve  run  several  iterations  of  TraceMIN.  Even  if  every  MPI  process  were  assigned 
the  same  number  of  subintervals,  and  each  subinterval  contained  the  same  number  of 
eigenvalues,  there  would  still  be  potential  for  a  massive  load  imbalance  simply  because 
some  intervals  have  a  considerably  more  favorable  eigenvalue  distribution  than  others. 

Dynamic  work  allocation 

To  remedy  the  load  imbalance  issues  of  the  previous  implementation,  we  can 
dynamically  assign  the  work  as  needed.  This  process  is  best  described  via  an  analogy. 

At  McDonalds  (or  any  other  large  company),  there  is  a  hierarchical  structure  to 
the  employees.  There  is  one  CEO  who  is  responsible  for  assigning  work  to  the  other 
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employees.  That  is  his  entire  responsibility;  he  doesn’t  go  down  to  the  kitchen  and 
flip  burgers.  McDonalds  also  employs  thousands  of  workers  who  are  only  responsible 
for  doing  the  work,  i.e.  flipping  burgers.  These  workers  never  communicate  with  the 
CEO  directly  because  that  would  be  overwhelming  for  the  CEO.  Instead,  they  are 
divided  up  into  groups  based  on  their  location,  and  each  group  has  a  store  manager. 
The  store  manager  is  responsible  for  relaying  important  messages  between  the  workers 
and  the  CEO.  Unlike  the  CEO,  the  store  manager  is  also  required  to  flip  burgers. 

In  this  TraceMIN  implementation,  we  divide  the  MPI  processes  into  three  general 
categories  similar  to  the  categories  of  McDonalds  employees: 

•  master:  similar  to  the  CEO,  responsible  for  assigning  work 

•  worker:  similar  to  the  burger-flippers,  responsible  for  doing  work  on  an  indi¬ 
vidual  interval 

•  group  leader:  similar  to  the  store  manager,  responsible  for  work  and  communi¬ 
cation 

The  MPI  processes  are  broken  up  into  groups  consisting  of  a  leader  and  many 
workers.  Each  group  handles  one  interval  at  a  time  (which  has  been  assigned  by 
the  master).  There  are  two  types  of  communication  for  two  levels  of  parallelism. 
The  master  sends  messages  to  the  individual  group  leaders,  informing  them  of  which 
subinterval  their  group  is  expected  to  process.  The  group  leader  and  workers  col¬ 
laborate  to  run  TraceMIN  on  their  assigned  subinterval.  Figure  6.10  illustrates  this 
process  with  an  example. 

Which  method  to  choose 

With  very  few  MPI  processes,  it  does  not  make  sense  to  use  a  dynamic  work  allo¬ 
cation,  since  it  prevents  one  MPI  process  from  doing  any  useful  work;  it  is  preferable 
to  use  the  static  work  allocation  in  that  case.  If  there  are  enough  MPI  processes  for 
the  load  imbalance  to  become  apparent,  the  dynamic  work  allocation  works  better. 
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(31 


Master 


Incomplete  work... 
[1000,12501-20  evals 
[5000,5200]- 13  evals 


Our  work... 

I  [0,1000] -50  evals  I 


leader  1 


Other  groups. 


Worker  1  -2  |  Worker  1  -3 1  Worker  1  -4 


(a)  The  master  maintains  a  list  of  work  that  still  needs  to  be 
done.  Group  1  has  a  large  interval  that  must  be  subdivided. 


(b)  Group  1  performs  a  factorization  to  divide  its  interval  in  two. 
It  keeps  one  of  the  subintervals,  and  the  group  leader  sends  the 
other  to  the  master. 


Figure  6.10.:  A  demonstration  of  TraceMIN’s  dynamic  load  balancing 
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Master 


(3 

r 

Incomplete  work... 
[1000, 1250] -20  evals 
[5000,5200] -13  evals 
[500, 1000] -40  evals 

Our  work... 
[0,500]- 10  evals 


Group  leader  1 


Other  groups. 


Worker  1  -2  Worker  1  -3  Worker  1  -4 


(a)  The  master  has  added  the  subinterval  (500,1000)  to  the 
list  of  incomplete  work.  Meanwhile,  group  1  runs  TraceMIN 
on  the  small  interval  (0,500). 


(b)  Group  1  has  found  the  10  eigenvalues  in  the  interval 
(0,500)  and  needs  more  work.  The  group  leader  requests 
more  work  from  the  master. 


Figure  6.10.:  A  demonstration  of  TraceMIN’s  dynamic  load  balancing  (continued) 
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Here,  have  an  interval! 
[1000,1250] -20  evals 


Master 

Incomplete  work... 

[5000,5200] -13  evals 

[500,1000] -40  evals 

(3 

Our  work.. 
NONE 


Group  leader  1 


Other  groups. 


Worker  1-2  Worker  1-3  Worker  1-4 


(a)  The  master  sends  an  interval  from  its  list  of  incomplete  work 
to  group  leader  1. 


(b)  Group  leader  1  relays  this  message  to  the  workers  in  its 
group,  since  the  master  never  directly  communicated  with  them. 


Figure  6.10.:  A  demonstration  of  TraceMIN’s  dynamic  load  balancing  (continued) 
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(a)  Group  1  runs  TraceMIN  on  the  interval  (1000,1250). 


Figure  6.10.:  A  demonstration  of  TraceMIN’s  dynamic  load  balancing  (continued) 
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7  COMPETING  EIGENSOLVERS 

I  compare  TraceMIN  with  several  state  of  the  art  packages  for  computing  eigenpairs 
of  sparse  symmetric  eigenvalue  problems  such  as 

•  SLEPc:  an  eigensolver  package  built  on  top  of  Argonne  National  Laboratory’s 
PETSc,  which  implements  a  variety  of  different  eigensolvers  [22-26] 

•  Anasazi:  the  eigensolver  package  of  Sandia  National  Laboratory’s  Trilinos  project 
[20,36] 

•  FEAST:  Eric  Polizzi’s  contour  integration  eigensolver  package  [37,38] 

The  methods  included  in  these  packages  are  described  in  this  section. 

7.1  Arnoldi,  Lanczos,  and  Krylov-Schur 

These  three  methods  are  very  similar  to  the  power  iteration  for  computing  the 
largest  eigenpair  of  a  matrix,  except  that  the  power  iteration  uses  a  constant  sub¬ 
space  dimension  (like  TraceMin)  and  these  methods  use  expanding  subspaces  (like 
TraceMin-Davidson) .  The  basic  Arnoldi  iteration  generates  a  Krylov  subspace  of  A 
one  vector  at  a  time  as  in  algorithm  7,  then  computes  the  eigenpairs  of  V^AVX  = 
XQ,  where  V  is  the  basis  of  that  Krylov  subspace^.  The  vectors  Y  =  VX  are  approxi¬ 
mate  eigenvectors  of  A,  and  the  diagonal  entries  of  0  are  the  approximate  eigenvalues. 
When  the  subspace  becomes  too  large,  we  restart,  keeping  the  most  important  vectors 
of  the  subspace  and  discarding  the  rest  (just  like  TraceMin-Davidson).  We  can  add 
one  vector  to  the  subspace  at  each  iteration,  or  many  if  we’re  using  block- Arnoldi. 

^In  the  Arnoldi  iteration,  V'^AV  will  be  upper  Hessenberg.  If  A  is  symmetric,  P^AP  will  be 
tridiagonal  and  we  have  the  Lanczos  iteration. 
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Note  that  this  method  cannot  compute  the  eigenpairs  of  a  generalized  eigenvalue 
problem  without  a  spectral  transformation  (i.e.  B~^Ax  =  Xx). 

If  we  seek  the  smallest  eigenpairs  of  a  matrix,  we  generally  work  with  A~^  rather 
than  A,  which  is  referred  to  as  shift-and-invert  mode.  Note  that  we  would  never  form 
the  matrix  A~^  explicitly;  at  each  iteration,  we  must  solve  a  linear  system  Avk  =  Ufc-i- 
We  may  use  either  a  direct  or  preconditioned  iterative  method,  but  the  solution  must 
be  accurate.  In  general,  if  we  want  the  relative  residual  of  our  eigenvalues  to  be  less 
than  10“'^,  these  linear  systems  should  be  solved  with  a  relative  residual  no  more  than 


Algorithm  7  Arnoldi  iteration 
Require:  A  E 

Vo  E 


1: 

2: 

3: 

4: 

5: 

6: 

7: 

8: 


77i  =  — — ^ — 

Iholh 

for  /c  =  2  — )■  maxit  do 
Vk  =  Avk-i 
for  j  =  1  — )■  /c  —  1  do 
Vk=Vk-  Vjvjvk 

end  for 

end  for 


Krylov-Schur  is  very  similar  to  Arnoldi  apart  from  how  restarting  is  handled  [41, 
42].  When  Arnoldi  is  restarted  with  a  set  of  vectors  Vq,  h  expects  Vq  AVo  to  still  be 
upper-Hessenberg.  Krylov-Schur  relaxes  the  dehnition  of  an  Arnoldi  decomposition 
to  avoid  the  difficulties  Arnoldi  has  with  deflation  and  restarting.  Krylov  Schur  is 
implemented  in  the  eigensolver  package  SLEPc,  and  a  block  form  exists  in  Anasazi. 
Both  forms  are  capable  of  using  shift-and-invert  mode,  so  I  chose  to  run  SLEPc’s 
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Krylov-Schur  with  shift-and-invert  and  Anasazi’s  block  Krylov-Schur  without  shift- 
and-invert^. 


7.1.1  Krylov-Schur  with  multisectioning 

SLEPc’s  Krylov-Schur  implementation  is  capable  of  multisectioning,  but  it  must 
process  each  subinterval  sequentially.  All  processes  call  the  sparse  factorization  pack¬ 
age  MUMPS  to  factor  the  matrix  in  parallel  [33,34].  Because  the  MPI  processes 
can  not  work  independently,  there  is  the  potential  for  an  overwhelming  amount  of 
communication,  and  Krylov-Schur  will  scale  as  MUMPS  scales.  Note  that  in  my 
TraceMin-multisectioning  implemention,  MUMPS  is  never  called  by  all  processes  si¬ 
multaneously. 


7.2  Locally  Optimal  Block  Preconditioned  Conjugate  Gradient 


The  Locally  Optimal  Preconditioned  Conjugate  Gradient  method  minimizes  (or 
maximizes)  the  generalized  Rayleigh  quotient  at  each  iteration  using  a  three  term 
recurrence  e.g. 


.  ^  y'^Ay 

Vi+i  =  arg  mm  p{y)  = 

y&spa.n{yi,Zi,yi-i}  y  ny 

where  =  M~^r  is  the  preconditioned  residual  [43].  The  solution  to  this  minimiza¬ 
tion  problem  is  obtained  via  the  Rayleigh-Ritz  procedure,  as  outlined  in  algorithm 
8.  Although  this  algorithm  demonstrates  the  single  vector  case,  one  can  choose  to 
use  blocks  instead,  obtaining  the  Locally  Optimal  Block  Preconditioned  Conjugate 
Gradient  method  (LOBPCG). 

This  method  may  experience  trouble  in  the  R-orthonormalization  step  if  the  it¬ 
erations  stagnate,  because  in  that  case  yi  ~  yi-i-  It  is  also  only  capable  of  finding 
extreme  eigenpairs.  We  will  compare  against  both  the  Trilinos  implementation  and 
SLEPC’s  interface  to  BLOPEX,  which  is  Andrew  Knyazev’s  own  implementation  [44]. 


^In  the  absence  of  any  spectral  transformations,  the  only  difference  between  using  Krylov-Schur  to 
compute  the  smallest  eigenpairs  or  the  largest  is  which  vectors  are  kept  upon  restart. 
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Algorithm  8  Locally  Optimal  Preconditioned  Conjugate  Gradient 
Require:  A,B^  both  symmetric  positive  definite 

Vi  G 

1:  for  A;  =  1  — )■  maxit  do 

3:  9  k  =  ylAyk 

4:  Tk  =  Ayk  -  6kByk 

5:  Zk  =  M“Vfe 

6:  if  /c  >  1  then 

7:  Vk  =  [yk-i  yk  Zk] 

8:  else 

9-  hjs  [l/fc  Zk\ 

10:  end  if 

11:  R-orthonormalize  14 

12:  H  =  VifAVk 

13:  Solve  the  small  dense  eigenvalue  problem  HX  =  XQ 

14:  Vk+i  =  VkXi,  where  xi  is  the  eigenvector  corresponding  to  the  smallest  eigen¬ 

value  of  H 
15:  end  for 
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7.3  Jacob  i-Davidson 

Jacobi-Davidson  is  an  eigensolver  which  deals  with  the  same  constrained  minimiza¬ 
tion  problem  as  TraceMin,  using  the  same  projected- Krylov  method.  It  was  published 
in  1996,  14  years  after  the  TraceMin  concept  was  hrst  published  by  Ahmed  Sameh 
and  John  Wisniewski  [9,45].  In  their  1996  publication,  Sleijpen  and  van  der  Vorst, 
applied  it  to  the  nonsymmetric  case  without  a  proof  of  global  convergence.  Later, 
they  popularized  their  scheme  for  the  symmetric  case  for  which  TraceMIN  proved 
convergence  much  earlier.  Unlike  TraceMin,  Jacobi-Davidson  extracts  its  Ritz  vec¬ 
tors  from  a  subspace  that  expands  at  each  iteration;  this  expanding  subspace  concept 
was  later  incorporated  into  the  trace-minimization  algorithm  and  given  the  name 
TraceMin-Davidson  in  2000  [17].  TraceMin  and  TraceMin-Davidson  use  a  very  con¬ 
servative  method  to  compute  their  shifts,  whereas  Jacobi-Davidson  chooses  the  shifts 
to  be  equal  to  the  Ritz  values.  These  Ritz  values  are  frequently  gross  overestimates 
for  the  true  eigenvalues  of  a  matrix  and  can  result  in  very  slow  convergence,  or  con¬ 
vergence  to  the  wrong  set  of  eigenpairs  entirely.  I  will  compare  my  eigensolver  with 
the  SLEPc  implementation  of  Jacobi-Davidson.  In  order  to  avoid  convergence  issues 
caused  by  the  original  Jacobi-Davidson  shifting  strategy,  the  SLEPc  developers  chose 
to  avoid  shifting  until  the  residual  becomes  very  small. 

7.4  Riemannian  Trust  Region  method 

The  Riemannian  Trust  Region  (RTR)  method  is  very  similar  to  TraceMin  in  that 
it  also  seeks  to  minimize  the  function 

fy  (A)  =  trace  (^((K  -  AfB{Y  -  A))''  ((K  -  Af  A  {Y  -  A))^  (7.1) 

for  all  A  Tb  y  [46,47].  Assuming  Y  has  been  R-orthonormalized,  the  Taylor  series 
expansion  of  fy  about  A  =  0  yields  the  following  model,  which  is  used  by  RTR 

^RTR  (Y^AY^  —  trace  (2Y^AA'j  H — trace  (2A^  (A A  —  BAY^AY^^ 

(7.2) 
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TraceMin  approximates  the  Hessian  of  the  matrix  as  2A,  giving  us 

m™  (A)  =  trace  (Y^AY)  —  trace  (2Y'^AA)  +  ^trace  (2A'^HA)  (7.3) 

RTR’s  model  is  more  accurate  and  provides  a  better  (superlinear)  convergence  rate, 
but  the  individual  TraceMin  iterations  are  cheaper. 

In  the  absence  of  shifts,  TraceMin  tends  to  reduce  the  trace  very  quickly  in  its 
hrst  few  iterations  before  the  trace  levels  off;  RTR  does  the  opposite,  reducing  the 
trace  very  slowly  over  the  hrst  few  iterations  due  to  the  trust  region  constraint.  Our 
tests  will  show  comparisons  with  Chris  Baker’s  RTR  implementation  in  Trilinos  [46] . 

7.5  FEAST 

FEAST  is  Eric  Polizzi’s  eigensolver  package,  which  was  recently  adopted  into  the 
Intel  Math  Kernel  Library.  This  eigensolver  works  by  performing  a  contour  integra¬ 
tion  at  each  iteration  [37,38].  As  a  result,  FEAST  treats  all  matrices  as  complex 
and  requires  considerably  more  storage  than  TraceMin.  It  must  solve  many  linear 
systems  {ZjB  —  A)  Qj  =  V  at  each  iteration,  one  for  each  contour  point.  Since  all 
input  matrices  are  treated  as  complex,  it  is  difficult  to  use  iterative  methods  to  solve 
the  systems.  In  our  comparisons,  we  will  use  FEAST  v  2.1  with  its  default  linear 
solver,  PARDISO  [31,32]. 

Another  result  of  the  contour  integration  is  that  FEAST  requires  a  lot  of  infor¬ 
mation  from  the  user.  It  needs  to  know  both  the  interval  containing  all  eigenvalues 
of  interest,  as  well  as  an  accurate  estimate  of  the  number  of  eigenvalues  within  that 
interval.  Although  FEAST  is  one  of  the  few  eigensolver  packages  currently  capable 
of  multisectioning,  it  does  require  the  user  to  explicitly  provide  the  subintervals;  it 
does  not  determine  them  on  its  own  like  TraceMIN  does. 
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Algorithm  9  FEAST 

Require:  A,B^  A  symmetric  and  B  symmetric  positive  definite 

subspace  dimension  s 
number  of  Gaussian  quadrature  points  Ne 
Gauss  nodes  Ug,  1  <  j  <  weights  cuj,  I  <  j  < 
desired  interval  [Amin,  Amax] 

V  e 

1:  for  k  =  1  maxit  do 

2:  Set  g  =  0,  g  G 

3:  Set  (J  (Amax  Amin)  /2 

4:  for  J  =  1  — )■  Ae  do 

5:  Gompute  =  —  (7r/2)  (nj  —  1) 

6:  Gompute  Zj  =  (Amax  +  Amin)  /2  +  CTC*®-’ 

7:  Solve  (ZjB  -  A)  Qj  =  V 

8:  Gompute  Q  =  Q  —  {ojj/2)  3? 

9:  end  for 

10:  Form  Aq  =  Q^AQ  and  Bq  =  Q^BQ 

11:  Solve  the  eigenvalue  problem  AqX  =  BqXT, 

12:  Gompute  the  Ritz  vectors  Y  =  QX 

13:  Gheck  convergence 

14:  V  =  BY 

15:  end  for 


8  NUMERICAL  EXPERIMENTS 


8.1  Target  hardware 

The  TraceMIN  algorithm  can  be  implemented  on  any  parallel  compnting  plat¬ 
form.  This  software  implementation  is  aimed  at  the  following  broad  class  of  parallel 
architectnres.  I  assume  a  distributed  memory  system  consisting  of  a  large  number  of 
compute  nodes  that  are  interconnected  via  a  high  performance  network,  where  each 
node  consists  of  several  cores.  My  results  were  obtained  on  the  following  architectures: 

•  a  Linux  cluster  with  multicore  nodes.  Each  node  has  two  12-core  Intel  Xeon 
E5-2697  v2  processors  running  at  2.7  GHz,  with  64  GB  of  memory  per  node. 
These  nodes  are  also  interconnected  via  a  fast  Infiniband  switch.  I  will  refer  to 
this  architecture  as  endeavor-1.  All  programs  were  run  with  either  12  threads 
or  12  MPI  processes  per  node  on  this  architecture. 

•  a  Linux  cluster  with  multicore  nodes.  Each  node  has  two  14-core  Intel  processors 
running  at  2.6  GHz,  with  64  GB  of  memory  per  node.  These  nodes  are  also 
interconnected  via  a  fast  Inhniband  switch.  I  will  refer  to  this  architecture  as 
endeavor-2.  All  programs  were  run  with  either  14  threads  or  14  MPI  processes 
per  node  on  this  architecture. 

Since  the  Trilinos  team  is  still  working  on  improving  the  performance  of  their  code 
with  OpenMP,  I  ran  all  Trilinos  code  (including  my  Trilinos-based  implementations 
of  TraceMin  and  TraceMin-Davidson)  with  multiple  processes  per  node.  Similarly,  I 
ran  the  SLEPc  tests  with  pure  MPI.  Both  FEAST  and  my  Fortran  implementations 
of  TraceMin-Sampling  and  TraceMin-Multisectioning  were  run  with  multiple  threads 
per  node. 
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8.2  Computing  a  small  number  of  eigenpairs 

In  each  case,  we  consider  the  desired  eigenpairs  to  be  converged  if  the  relative 
residual  satishes  the  following  criteria 

^  <  10-5 

where  ai  is  the  i-th  Ritz  value. 

The  experiments  are  conducted  both  with  and  without  preconditioning  on  endeavor- 
2.  The  preconditioner  M  is  chosen  such  that 

M-‘  =  ((/  -  A-„'Af  +  (/  -  A^^A)  +  /) 

where  Aq  is  the  diagonal  matrix  obtained  via  SPAI(O)  [48];  we  have  essentially  per¬ 
formed  a  small  number  of  Richardson  iterations. 

Unless  otherwise  stated,  each  eigensolver  used  its  default  parameter  values.  Trace- 
Min  used  a  block  size  of  s  =  2p,  where  p  is  the  number  of  desired  eigenpairs. 
TraceMin-Davidson  used  a  block  size  of  s  =  p  and  stores  a  maximum  of  10  blocks 
in  the  subspace  V.  Upon  restart,  TraceMin-Davidson  retains  the  2p  Ritz  vectors 
corresponding  to  the  smallest  Ritz  values  and  discards  the  rest.  Both  TraceMin  and 
TraceMin-Davidson  use  projected-MINRES  to  solve  the  saddle  point  problem  at  each 
iteration  if  we  do  not  use  preconditioning.  If  we  choose  to  take  advantage  of  a  pre¬ 
conditioner,  we  use  block-diagonal  preconditioned  MINRES  to  solve  the  saddle  point 
problems.  Trilinos’  block  Krylov-Schur  used  a  block  size  of  s  =  p  and  a  maximum 
subspace  dimension  of  lOp.  For  SLEPc’s  Krylov-Schur  with  shift-and-invert,  I  chose 
to  use  MINRES  as  the  inner  linear  solver  with  a  tolerance  of  lO”®. 

The  results  are  summarized  in  tables  8.1,  8.2,  and  8.3. 
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Table  8.1:  Robustness  of  various  solvers  on  our  test  problems. 

yes  denotes  that  a  solver  succeeded  on  this  problem  on  all  numbers  of  MPI  processes, 
and  no  means  the  solver  failed  for  some  reason 


Anasazi 

SLEPc 

Matrix 

TD 

BKS 

LOBPCG 

RTR 

KS 

LOBPCG 

JD 

Poisson 

yes 

yes 

no 

yes 

no 

no 

no 

Flan_1565 

yes 

no 

no 

yes 

no 

no 

yes 

Hook_1498 

yes 

no 

no 

yes 

no 

no 

yes 

cage 15 

yes 

yes 

yes 

yes 

yes 

yes 

yes 

nlpkkt240 

yes 

no 

no 

no 

no 

no 

yes 

Table  8.2:  Running  time  ratios  of  various  solvers  on  our  test  problems  (without 
preconditioning) 


Anasazi 

SLEPc 

Matrix 

TD 

BKS 

LOBPCG 

RTR 

KS 

LOBPCG 

JD 

Poisson 

1.0 

19.7 

3.0 

1.6 

- 

- 

- 

Flan_1565 

1.0 

- 

- 

1.3 

- 

- 

2.1 

Hook_1498 

1.0 

- 

- 

3.3 

- 

- 

1.5 

cage 15 

1.5 

2.5 

1.0 

1.7 

2.2 

5.4 

2.9 

nlpkkt240 

1.0 

- 

- 

- 

- 

- 

5.6 
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Table  8.3:  Running  time  ratios  of  various  solvers  on  our  test  problems  (with  precon¬ 
ditioning) 


Matrix 

TD 

LOBPCG 

RTR 

Poisson 

1.0 

11.7 

1.2 

Flan_1565 

1.0 

1.2 

1.8 

Hook_1498 

1.0 

4.9 

2.3 

cage 15 

2.3 

1.0 

- 
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Figure  8.1.:  Sparsity  pattern  of  LaplaceSD 


8.2.1  LaplaceSD 

For  this  problem,  A  is  the  3D  discretization  of  the  Laplace  operator  on  a  unit  cube 
of  order  64  million  with  roughly  450m  non  zeros  (Figure  8.1).  This  matrix  is  symmetric 
positive  dehnite  and  diagonally  dominant.  We  seek  the  four  smallest  eigenvalues 

(1.84  X  10-^3.68  X  10-^3.68  x  10-^3.68  x  10"^) 

along  with  their  associate  eigenvectors.  Note  that  one  of  these  eigenvalues  has  a 
multiplicity  of  three. 

In  hgure  8.2,  it  appears  as  though  SLEPc’s  Jacobi-Davidson  is  the  fastest  method; 
it  is  roughly  twice  as  fast  as  TraceMin-Davidson.  However,  since  it  uses  a  block  size 
of  1,  SLEPc’s  Jacobi-Davidson  failed  to  capture  the  correct  multiplicity  of  eigenvalue 
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3.68  X  10“^.  When  I  tried  to  increase  the  block  size  to  4,  Jacobi-Davidson  crashed. 
Again,  Jacobi-Davidson  and  TraceMin-Davidson  are  very  similar,  so  if  we  had  been 
able  to  increase  the  block  size,  Jacobi-Davidson  probably  would  have  had  comparable 
performance  to  my  code.  LOBPCG  converged  on  8,  16,  and  32  nodes,  but  it  crashed 
the  rest  of  the  time.  Because  we  were  not  using  a  preconditioner,  LOBPCG  took 
a  large  number  of  iterations  and  stagnated,  causing  an  orthogonalization  error  that 
resulted  in  termination  of  the  program.  SLEPc’s  Krylov-Schur  implementation  failed 
to  converge  in  a  reasonable  amount  of  time  because  it  took  so  long  to  solve  the  linear 
systems  accurately  (over  13  hours  on  4  nodes  and  2  hours  on  128  nodes).  TraceMin- 
Davidson  was  the  fastest  of  the  methods  which  found  the  correct  eigenpairs,  and  it 
had  a  nearly  optimal  speed  improvement  up  to  128  nodes. 

8.2.2  Flan_1565 

Janna/Flan_1565  is  a  symmetric  positive  dehnite  banded  matrix  in  the  Tim  Davis 
collection  [49],  representing  a  3D  model  of  a  steel  flange  (Figure  8.3).  It  is  order  1.5 
million,  with  approximately  100  million  nonzero  entries.  We  seek  the  four  smallest 
eigenvalues  and  their  associated  eigenvectors. 

Figure  8.4  shows  that  all  methods  scaled  quite  well  up  to  32  nodes,  then  began  to 
level  off  a  bit.  This  is  not  surprising,  given  the  relatively  small  size  of  the  matrix  and 
the  fact  that  we  have  not  assured  any  kind  of  load  balancing.  TraceMin-Davidson  was 
the  fastest  method,  though  the  other  two  related  methods  (Jacobi-Davidson  and  the 
Riemannian  Trust  Region  method)  were  also  able  to  solve  the  problem  in  a  reasonable 
amount  of  time.  Both  the  Trilinos  and  SLEPc  implementations  of  LOBPCG  failed 
to  solve  this  problem,  presumably  because  the  iterations  stagnated  and  resulted  in 
a  loss  of  orthogonality.  Trilinos  block  Krylov-Schur  failed  to  solve  the  problem  in  a 
reasonable  amount  of  time  (over  25  hours  on  2  nodes  and  2  hours  on  128  nodes),  and 
SLEPc’s  Krylov-Schur  with  shift-and-invert  failed  to  solve  the  linear  system  to  the 
required  degree  of  accuracy  and  terminated. 


ratio  of  running  time  to  fastest  running  time  (s) 
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T  raceMin-Davidson 
T  rilinos  RTR 
SLEPc JD 
T  rilinos  BKS 


(a)  Scalability  comparison 


■  T  raceMin-Davidson 

■  T rilinos  RTR 

■  SLEPc  JD 


number  of  nodes 


(b)  Ratio  of  running  times 


Figure  8.2.:  A  comparison  of  several  methods  of  computing  the  four  smallest  eigen- 
pairs  of  LaplaceSD  (without  preconditioning) 
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Figure  8.3.:  Sparsity  pattern  of  Flan_1565 


If  we  try  the  same  test  with  preconditioning,  we  obtain  the  results  illustrated  in 
hgure  8.5.  Preconditioning  took  the  running  time  of  TraceMin-Davidson  from  46s 
down  to  27s  on  128  nodes,  if  we  use  the  block  diagonal  preconditioned  MINRES 
previously  described  to  solve  the  saddle  point  problem  at  each  iteration.  In  fact,  we 
note  that  using  the  block  diagonal  preconditioning  in  this  case  is  over  twice  as  fast 
as  using  projected  MINRES.  With  preconditioning,  the  Trilinos  implementation  of 
LOBPCG  was  able  to  converge  on  2,  4,  and  16  nodes  because  the  preconditioner 
caused  the  iterations  to  stagnate  less  frequently,  but  it  still  crashed  most  of  the  time. 


ratio  of  running  time  to  fastest  running  time  (s) 
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■  T  raceMin-Davidson 
■Trilinos  RTR 
■SLEPCJD 


(a)  Scalability  comparison 


(b)  Ratio  of  running  times 


Figure  8.4.:  A  comparison  of  several  methods  of  computing  the  four  smallest  eigen- 
pairs  of  Janna/Flan_1565  (without  preconditioning) 


ratio  of  running  time  to  fastest  running  time  (s) 
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(a)  Scalability  comparison 


■  TD  (BD) 

■  TD  (PK) 
■ LOBPCG 

■  RTR 


number  of  nodes 


(b)  Ratio  of  running  times 

Figure  8.5.:  A  comparison  of  several  methods  of  computing  the  four  smallest  eigen- 
pairs  of  Janna/Flan_1565  (with  preconditioning) 
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8.2.3  Hook_1498 

Janna/Hook_1498  is  a  symmetric  positive  definite  banded  matrix  in  the  Tim  Davis 
collection,  representing  a  3D  model  of  a  steel  hook  (Figure  8.6).  It  is  order  1.5  million, 
with  approximately  60  million  nonzero  entries.  We  seek  the  four  smallest  eigenvalues 
and  their  associated  eigenvectors. 

Figure  8.7  shows  that  Jacobi-Davidson  was  the  fastest  method  up  to  16  nodes 
(although  TraceMin-Davidson  was  still  competitive),  but  on  larger  numbers  of  nodes, 
Jacobi-Davidson  fails  to  scale  welT.  This  is  likely  due  to  the  fact  that  SLEPc  is  inca¬ 
pable  of  using  block  or  pseudo-block  linear  solvers.  Because  my  TraceMin-Davidson 
implementation  uses  pseudo-block  solvers,  it  continued  to  scale  up  to  64  nodes.  Once 
again,  both  implementations  of  LOBPCG  crashed  due  to  orthogonalization  errors, 
and  Krylov-Schur  could  not  solve  the  linear  systems  to  a  sufficient  degree  of  precision 
in  the  shift-and-invert  mode. 

If  we  try  the  same  test  with  preconditioning,  we  obtain  the  results  of  figure  8.8. 
Preconditioning  did  not  greatly  impact  the  running  time  of  TraceMin-Davidson,  but 
it  did  cause  the  Riemannian  Trust  Region  method  to  converge  a  bit  faster.  It  also 
prevented  LOBPCG  from  stagnating. 

8.2.4  cagel5 

For  this  example,  we  will  be  computing  the  Fiedler  vector  of  a  directed  weighted 

graph  with  one  strongly  connected  component  vanHeukelum/cagel5  from  the  Tim 

Davis  collection;  I  will  refer  to  this  graph  as  G.  G  has  approximately  five  million 

rows  and  one  hundred  million  nonzeros.  Recall  that  the  weighted  graph  Laplacian  A 

will  be  symmetric  positive  semi-definite  with  a  null  space  of  dimension  1.  The  null 

vector  is  the  scaled  vector  of  all  Is,  which  we  provided  to  all  eigensolvers.  We  wish 

^BKS  was  over  30  times  slower  than  TraceMin-Davidson.  It  is  not  competitive  for  Hook_1498  and 
has  been  excluded  from  figure  8.7b. 

^In  general,  we  would  determine  the  strongly  connected  components  using  a  Dulmage-Mendelsohn 
permutation  and  treat  each  one  as  a  separate  problem. 
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Figure  8.6.:  Sparsity  pattern  of  Hook_1498 
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Figure  8.7.:  A  comparison  of  several  methods  of  computing  the  four  smallest  eigen- 
pairs  of  Janna/Hook_1498  (without  preconditioning) 
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Figure  8.8.:  A  comparison  of  several  methods  of  computing  the  four  smallest  eigen- 
pairs  of  Janna/Hook_1498  (with  preconditioning) 
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Figure  8.9.:  Sparsity  pattern  of  cagelS 


to  find  the  smallest  nonzero  eigenvalue  (the  graph  connectivity)  and  the  associated 
eigenvector,  which  is  referred  to  as  the  Fiedler  vector.  For  this  problem,  I  increased 
TraceMin’s  block  size  to  s  =  6  and  set  the  maximum  number  of  vectors  to  be  stored 
in  V  to  20  for  both  TraceMin-Davidson  and  block  Krylov-Schur. 

Figure  8.10  shows  a  comparison  between  many  different  eigensolvers  on  this  prob¬ 
lem.  We  see  that  all  methods  performed  very  well  because  it  was  easy  to  solve  linear 
systems  involving  A.  The  Trilinos  implementation  of  LOBPCG  was  the  fastest,  but 
TraceMin-Davidson  still  performed  quite  well.  The  issue  here  was  that  the  eigen¬ 
values  are  clustered,  and  TraceMin-Davidson  should  have  used  a  much  weaker  inner 
tolerance  than  it  did,  since  the  outer  convergence  rate  was  going  to  be  poor  regardless. 
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Figure  8.11  shows  that  we  still  did  very  well  on  this  problem,  in  comparison  with  the 
other  eigensolvers. 

8.2.5  nlpkkt240 

Schenk/nlpkkt240  is  a  symmetric  indehnite  KKT  matrix  in  the  Tim  Davis  col¬ 
lection  of  order  27  million  with  approximately  800  million  nonzeroes.  I  reordered  it 
to  a  banded  matrix  (hgure  8.12),  using  symmetric  reverse  Cuthill-McKee,  so  that 
the  matrix  vector  multiplications  would  be  more  efficient^.  We  seek  the  four  smalest 
magnitude  eigenvalues  with  their  associated  eigenvectors;  note  that  these  will  be  in¬ 
terior  eigenpairs  rather  than  extreme  ones.  The  Riemannian  Trust  Region  method 
and  LOBPCG  cannot  compute  interior  eigenpairs  without  a  spectral  transformation 
such  as  spectrum  folding. 

Figure  8.13  shows  the  running  time  of  both  TraceMin-Davidson  and  SLEPc’s 
Jacob i-Davidson  implementation,  both  of  which  performed  a  harmonic  Ritz  extrac¬ 
tion.  Trilinos  BKS  took  a  prohibitively  long  time  to  converge  (over  3  hours  on  128 
nodes),  and  SLEPc’s  Krylov-Schur  with  shift-and-invert  failed  to  converge  because 
the  Krylov  solver  was  unable  to  solve  the  linear  systems  to  a  sufficient  degree  of  pre¬ 
cision.  LOBPCG  (with  specturm  folding)  also  took  too  much  time  to  be  competitve. 
TraceMin-Davidson  was  over  six  times  faster  than  Jacobi-Davidson,  presumably  be¬ 
cause  Jacobi-Davidson  used  more  aggressive  shifts  which  approximated  eigenvalues 
that  were  much  larger  than  the  ones  we  desired.  We  also  see  that  TraceMin-Davidson 
scaled  almost  perfectly,  whereas  Jacobi-Davidson  did  not  scale  as  well  on  a  large 
number  of  nodes.  This  is  presumably  due  to  the  fact  that  TraceMin-Davidson  used 
a  pseudo-block  solver,  and  SLEPc’s  Jacobi-Davidson  did  not. 

^Technically,  since  both  Trilinos  and  SLEPc  allow  the  user  to  provide  an  operator  rather  than 
a  matrix,  we  could  have  provided  matvecs  and  linear  solvers  that  took  advantage  of  the  special 
structure  of  this  matrix. 
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(a)  Comparison  between  TraceMin-Davidson  and  other  Trilinos  eigensolvers 
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(b)  Comparison  between  TraceMin-Davidson  and  several  SLEPc  eigensolvers 


Figure  8.10.:  A  comparison  of  several  methods  of  computing  the  Fiedler  vector  for 
cage 15 
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Figure  8.11.:  Ratio  of  running  times  for  computing  the  Fiedler  vector  of  cagel5 
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Figure  8.12.:  Sparsity  pattern  of  nlpkkt240  (after  RCM  reordering) 
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Figure  8.13.:  A  comparison  of  several  methods  of  computing  the  four  smallest  eigen¬ 
values  of  nlpkkt240 
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Table  8.4:  Running  time  for  TraceMin-Sampling  on  the  Anderson  problem 


tolerance 

running  time  (s) 

10“^ 

283 

10-6 

306 

10-9 

392 

8.3  Sampling 

The  results  in  this  section  were  obtained  using  a  Fortran  90  implementation  of 
TraceMin,  using  PARDISO  to  solve  the  linear  systems  arising  at  each  iteration.  We 
have  converged  when  the  relative  residual 

||rj||2 

max(||A||^,i|R||J 

These  tests  were  run  on  endeavor-1. 

8.3.1  Anderson  model  of  localization 

For  this  example,  we  will  compute  a  few  eigenpairs  of  an  Anderson  matrix  of  order 
one  million  closest  to  a  set  of  shifts  (pictured  in  hgure  8.14).  We  will  compute  the 
four  eigenvalues  closest  to  100  evenly  spaced  shifts  in  the  interval  [—1, 1]  using  100 
MPI  processes.  Table  8.4  shows  that  it  only  took  us  hve  minutes  to  compute  the  400 
desired  interior  eigenpairs. 

8.3.2  Nastran  benchmark  (order  1.5  million) 

We  will  compute  a  few  eigenpairs  of  the  Nastran  benchmark  of  order  1.5  million 
in  this  example.  This  is  a  generalized  eigenvalue  problem;  the  sparsity  patterns  of 
A  and  B  are  plotted  in  hgure  8.15.  We  will  compute  the  4  eigenvalues  closest  to 
100  evenly  spaced  shifts  in  the  interval  [—0.01;  1,461,000]  using  100  MPI  processes. 
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Figure  8.14.:  Sparsity  pattern  of  the  Anderson  matrix 
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Table  8.5:  Running  time  for  TraceMin-Sampling  on  the  Nastran  benchmark  of  order 
1.5  million 


tolerance 

running  time  (s) 

10-® 

21 

10-6 

25 

10-9 

40 

Table  8.6:  Running  time  for  TraceMin-Sampling  on  the  Nastran  benchmark  of  order 
7.2  million 


tolerance 

rnnning  time  (s) 

10-6 

181 

10-6 

200 

10-9 

302 

Table  8.5  shows  that  it  only  took  about  30  seconds  to  compute  the  400  desired  interior 
eigenpairs. 

8.3.3  Nastran  benchmark  (order  7.2  million) 

We  will  compute  a  few  eigenpairs  of  the  Nastran  benchmark  of  order  7.2  million 
in  this  example.  This  is  a  generalized  eigenvalue  problem;  the  sparsity  patterns  of 
A  and  B  are  plotted  in  hgure  8.16.  We  will  compnte  the  4  eigenvalnes  closest  to 
100  evenly  spaced  shifts  in  the  interval  [—0.01;  2,  785,  937.5]  using  100  MPI  processes. 
Table  8.6  shows  that  it  only  took  a  few  minutes  to  compute  the  400  desired  interior 
eigenpairs. 


0 


(a)  Sparsity  pattern  of  stiffness  matrix  A 


(b)  Sparsity  pattern  of  mass  matrix  B 

Figure  8.15.:  Sparsity  patterns  for  the  Nastran  benchmark  of  order  1.5  million 
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(a)  Sparsity  pattern  of  stiffness  matrix  A 


nz  =  185790017  6 
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(b)  Sparsity  pattern  of  mass  matrix  B 


Figure  8.16.:  Sparsity  patterns  for  the  Nastran  benchmark  of  order  7.2  million 
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8.4  TraceMin-Multisectioning 

In  these  examples,  we  seek  all  eigenpairs  located  in  a  large  interval,  with  a  relative 
residual 

- -  ^q-6 

max(||kl||,,||i?||J 

on  the  platform  endeavor-2.  This  interval  can  contain  thousands  of  eigenpairs,  so  I  will 
be  running  the  multisectioning  code  previously  described.  TraceMin-Multisectioning 
will  subdivide  the  global  interval  until  it  has  many  smaller  subintervals,  each  contain¬ 
ing  at  most  20  eigenvalues.  I  will  use  1  MPI  process  per  node,  with  14  threads  per 
process,  using  PARDISO  to  compute  the  inertia  and  solve  the  linear  systems. 

I  compared  my  code  against  both  FEAST  and  SLEPc’s  Krylov-Schur  with  spec¬ 
trum  slicing.  Since  FEAST  requires  users  to  explicitly  subdivide  the  interval  them¬ 
selves,  I  divided  it  into  p  equally  sized  pieces,  where  p  is  the  number  of  MPI  processes, 
and  provided  each  MPI  process  with  a  single  one  of  those  pieces.  I  ran  FEAST  with 
14  threads  per  MPI  process,  with  1  MPI  process  per  node.  SLEPc’s  Krylov-Schur 
spectrum  slicing  implementation  does  dynamically  subdivide  the  interval,  but  the 
subintervals  are  dependent.  As  a  result,  they  must  be  processed  one  at  a  time,  using 
all  MPI  processes  on  a  single  group,  which  can  result  in  poor  scalability.  No  timing 
results  are  presented  for  SLEPc,  since  all  tests  failed  with  the  error  message:  “PETSC 
ERROR:  Unexpected  error  in  Spectrum  Slicing!  Mismatch  between  number  of  values 
found  and  information  from  inertia.”  The  results  are  summarized  in  table  8.7. 

8.4.1  Nastran  benchmark  (order  1.5  million) 

We  seek  all  the  eigenpairs  in  the  region  [-0.01,1.461e6],  which  contains  1000  eigen¬ 
pairs  (hgure  8.17).  TraceMin  and  FEAST  both  performed  reasonably  well  on  this 
problem  up  to  33  nodes  (as  shown  in  hgure  8.18),  then  failed  to  continue  scaling. 
However,  TraceMin-Multisectioning  still  managed  to  compute  all  1000  eigenpairs  in 
roughly  one  minute  on  as  few  as  33  nodes,  and  it  was  over  twice  as  fast  as  FEAST. 
We  also  see  that  the  subdivision  of  intervals,  which  could  be  considered  to  be  a  pre- 
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Table  8.7:  Running  time  comparison  of  FEAST  and  TraceMin-Multisectioning 


Matrix 

Interval 

nev 

TraceMin 

FEAST 

Speedup 

Nastran  1.5m 

[-0.01,1.461e6] 

1000 

59  s 

121  s 

2.2 

Nastran  7.2m 

[-0.01,2785937.5] 

1000 

418  s 

- 

- 

Anderson 

[-0.01,0.01] 

1143 

792  s 

7910  s 

10.0 

af^helllO 

[2000,2250] 

1045 

37  s 

274  s 

7.3 

dielFilterVSreal 

[25,50] 

2969 

301  s 

912  s 

3.0 

StocF-1465 

[580,600] 

4150 

195  s 

738  s 

3.8 
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Figure  8.17.:  Histogram  of  the  eigenvalues  of  interest  for  the  Nastran  benchmark 
(order  1.5  million) 


processing  step,  took  less  than  10%  of  the  total  running  time.  FEAST  crashed  on 
both  2  and  3  nodes  because  it  could  not  store  so  many  eigenvectors  on  a  single  node 
(along  with  the  matrices  A  and  B,  and  also  the  complex  factorization). 

8.4.2  Nastran  benchmark  (order  7.2  million) 

We  seek  all  the  eigenpairs  in  the  region  [-0.01,2785937.5],  which  contains  1000 
eigenpairs  (hgure  8.20).  FEAST  failed  to  run  on  any  number  of  nodes  in  this  case, 
presumably  because  the  complex  factorization  of  such  a  large  matrix  took  up  too 
much  memory.  TraceMin  performed  reasonably  well  on  this  problem  up  to  33  nodes 
again  (as  shown  in  figure  8.18),  then  failed  to  continue  scaling. 


ratio  of  running  time  to  fastest  running  time  (s) 
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■■ - TraceMin  ♦  FEAST 

perfect  scaling  (TM)  —  —  perfect  scaling  (FEAST) 

(a)  Scalability  comparison 


■  TraceMin 

■  FEAST 


number  of  nodes 


(b)  Ratio  of  running  times 

Figure  8.18.:  A  comparison  of  several  methods  of  computing  a  large  number  of  eigen¬ 
values  of  the  Nastran  benchmark  (order  1.5  million) 
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(a)  Amount  of  time  spent  in  each  stage  (s) 


■  idle  time 
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(b)  Percent  of  time  spent  in  each  stage 


Figure  8.19.:  Running  time  breakdown  for  TraceMin-Multisectioning  on  the  Nastran 
benchmark  (order  1.5  million) 
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Figure  8.20.:  Histogram  of  the  eigenvalues  of  interest  for  the  Nastran  benchmark 
(order  7.2  million) 


1 00000 


10 

1 

2  3  5  9  17  33  65  129 

number  of  nodes 

- ■ - TraceMin  —  —  perfect  scaling  (TM) 


Figure  8.21.:  A  comparison  of  several  methods  of  computing  a  large  number  of  eigen¬ 
values  of  the  Nastran  benchmark  (order  7.2  million) 
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(a)  Amount  of  time  spent  in  each  stage  (s) 
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(b)  Percent  of  time  spent  in  each  stage 


Figure  8.22.:  Running  time  breakdown  for  TraceMin-Multisectioning  on  the  Nastran 
benchmark  (order  7.2  million) 
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Figure  8.23.:  Histogram  of  the  eigenvalues  of  interest  for  the  Anderson  model 


8.4.3  Anderson  model  of  localization 

We  seek  all  the  eigenpairs  in  the  region  [-0.01,0.01],  which  contains  1143  eigen- 
pairs  (hgure  8.23).  FEAST  failed  to  run  on  a  small  number  of  nodes,  again  due  to 
memory  issues.  It  failed  to  scale  at  all  from  5  to  129  nodes  since  the  vast  majority 
of  the  running  time  is  spent  on  the  factorization  stage  rather  than  solving  linear  sys¬ 
tems;  no  matter  how  many  subintervals  we  use,  we  still  require  the  same  number  of 
factorizations.  TraceMin-Multisectioning  spent  most  of  its  time  in  the  preprocessing 
stage,  since  this  matrix  is  so  difficult  to  factor‘d.  TraceMin-Multisectioning  was  still 
10  times  faster  than  FEAST  on  this  problem. 

^Recall  that  our  matrix  A  has  the  same  sparsity  pattern  as  the  3D  discretization  of  the  Laplace 
operator  with  periodic  boundary  conditions. 
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Figure  8.24.:  A  comparison  of  several  methods  of  computing  a  large  number  of  eigen¬ 
values  of  the  Anderson  model 
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Figure  8.25.:  Running  time  breakdown  for  TraceMin-Multisectioning  on  the  Anderson 
model 
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Figure  8.26.:  Sparsity  pattern  of  aLshelllO 


8.4.4  aLshelllO 

The  matrix  Schenk_AFE/af_shelllO  (hgure  8.26)  arises  from  an  AutoForm  En¬ 
gineering  sheet  metal  forming  simulation.  We  will  compute  all  the  eigenpairs  in 
the  interval  [2000,2250],  which  contains  1045  eigenvalues  (hgure  8.27).  In  this  case, 
TraceMin  multisectioning  scaled  reasonably  well  up  to  129  nodes  (hgure  8.28),  while 
FEAST  did  not.  Figure  8.29  shows  that  we  scaled  well  because  factorizations  were 
cheap,  and  our  processes  did  not  spend  a  large  amount  of  time  idle. 
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Figure  8.27.:  Histogram  of  the  eigenvalues  of  interest  for  aLshelllO 
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Figure  8.28.:  A  comparison  of  several  methods  of  computing  a  large  number  of  eigen¬ 
values  of  af_shelllO 
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Figure  8.29.:  Running  time  breakdown  for  TraceMin-Multisectioning  on  af^helllO 
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8.4.5  dielFilterVSreal 

Dziekonski/dielFilterVSreal  is  an  electromagnetics  matrix  of  order  1.1  million  pic¬ 
tured  in  figure  8.30.  We  will  compute  all  eigenpairs  in  the  interval  [25,50],  which 
contains  2969  eigenpairs  (hgure  8.31).  Note  that  these  eigenvalues  are  not  evenly 
distributed  throughout  the  interval.  FEAST  fails  on  2,  3,  5,  9,  and  17  nodes  because 
it  ran  out  of  space  (hgure  8.32);  the  intervals  closest  to  26  are  very  dense  and  require 
a  great  deal  of  storage.  TraceMin  did  not  run  into  storage  issues  because  of  its  dy¬ 
namic  multisectioning  strategy.  Figure  8.33  shows  that  both  the  time  spent  factoring 
the  matrix  and  the  time  spent  running  the  eigensolver  scaled  as  we  would  expect, 
but  it  did  spent  a  large  amount  of  time  idle  on  large  numbers  of  nodes,  resulting 
in  suboptimal  scalabiltiy  on  129  nodes.  TraceMin-Multisectioning  was  still  3  times 
faster  than  FEAST. 

8.4.6  StocF-1465 

Janna/StocF-1465  is  a  huid  dynamics  matrix  of  order  1.4  million  pictured  in  hgure 
8.34.  We  will  compute  all  eigenpairs  in  the  interval  [580,600],  which  contains  4150 
eigenvalues  (hgure  8.35).  Unlike  the  previous  example,  these  eigenvalues  are  pretty 
evenly  distributed  in  the  interval.  FEAST  still  failed  on  2,  3,  5,  and  9  nodes,  since  the 
number  of  vectors  it  needed  to  store  was  so  large  (hgure  8.36).  On  17  nodes,  it  has  a 
comparable  running  time  to  TraceMin,  but  it  failed  to  scale  beyond  that  point.  Once 
again,  the  factorization  is  the  most  expensive  part,  so  the  cost  of  running  FEAST 
will  be  roughly  the  same  regardless  of  how  many  eigenvalues  an  interval  contains. 
TraceMin  scaled  well  up  to  129  nodes  because  in  this  case,  there  was  plenty  of  work 
to  go  around,  so  the  MPI  processes  did  not  spend  a  great  deal  of  time  idle  (hgure 
8.37). 


Figure  8.30.:  Sparsity  pattern  of  dielFilterVSreal 
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Figure  8.31.:  Histogram  of  the  eigenvalues  of  interest  for  dielFilterVSreal 
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Figure  8.32.:  A  comparison  of  several  methods  of  computing  a  large  number  of  eigen¬ 


values  of  dielFilterV3real 


131 


number  of  MPI  processes 


■time  spent  factoring 
■time  spent  running 
eigensolver 
'  idle  time 


(a)  Amount  of  time  spent  in  each  stage  (s) 


■  idle  time 

■  time  spent  running 
eigensolver 

■  time  spent  factoring 


number  of  MPI  processes 


(b)  Percent  of  time  spent  in  each  stage 


Figure  8.33.:  Running  time  breakdown  for  TraceMin-Multisectioning  on  dielFil- 
terV3real 


Figure  8.34.:  Sparsity  pattern  of  StocF-1465 
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Figure  8.35.:  Histogram  of  the  eigenvalues  of  interest  for  StocF-1465 
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■■ - TraceMin  ♦  FEAST 

perfect  scaling  (TM)  —  —  perfect  scaling  (FEAST) 

(a)  Scalability  comparison 
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Figure  8.36.:  A  comparison  of  several  methods  of  computing  a  large  number  of  eigen¬ 
values  of  StocF-1465 
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Figure  8.37.:  Running  time  breakdown  for  TraceMin-Multisectioning  on  StocF-1465 
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9  FUTURE  WORK 

In  this  chapter,  we  will  examine  further  improvements  to  the  methods  previously 
discussed. 

9.1  Improved  selection  of  the  tolerance  for  Krylov  solvers  within  TraceMin 

We  have  seen  that  TraceMin  is  capable  of  converging  even  when  we  use  a  modest 
tolerance  for  the  Krylov  solver  called  at  each  iteration  of  TraceMin.  The  stricter 
the  inner  tolerance,  the  fewer  TraceMin  iterations  are  required;  however,  strict  inner 
tolerances  also  increase  the  number  of  Krylov  iterations  required.  The  optimal  inner 
tolerance  is  based  on  the  ratio  of  the  desired  eigenvalue  to  the  smallest  eigenvalue 
outside  of  our  desired  subspace 

A. 

As+i 

If  this  ratio  is  large  (meaning  the  eigenvalues  are  clustered),  we  should  use  a  lenient 
inner  tolerance;  if  the  ratio  is  closer  to  0  (meaning  the  eigenvalues  are  well  separated), 
we  should  use  a  stricter  inner  tolerance.  In  practice,  we  do  not  know  this  ratio,  so  it 
is  difficult  to  choose  an  effective  inner  tolerance.  I  presented  a  method  of  selecting  the 
inner  tolerance  based  on  both  the  Ritz  values  and  the  current  iteration  number  which 
tends  to  work  well.  However,  in  one  of  the  test  cases  presented  in  the  Results  section, 
LOBPCG  was  faster  than  TraceMin-Davidson,  since  we  solved  the  linear  systems 
much  more  accurately  than  what  would  have  been  optimal.  We  must  devote  more 
attention  to  the  selection  of  this  inner  tolerance  to  avoid  such  situations.  Perhaps 
we  could  select  the  inner  tolerance  based  on  the  trace  reduction  at  each  iteration  of 


TraceMin. 
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9.2  Combining  the  strengths  of  TraceMin  and  the  Riemannian  Trust  Region  method 

TraceMin  uses  an  inexact  Hessian  of  the  generalized  Rayleigh  quotient  (equation 
9.1),  which  causes  its  linear  convergence. 

HTraceMin  2  A  (9.1) 

The  Riemannian  Trust  Region  method  (RTR)  uses  an  exact  Hessian  (equation  9.2) 
in  order  to  obtain  superlinear  convergence. 

ffRTR  :  S  ^  2  (^AS  -  BS  (Y^BY)  Y^AY^  (9.2) 

These  two  methods  have  complementary  qualities.  TraceMIN  starts  off  with  a  sharp 
decrese  to  the  trace,  but  the  trace  stagnates  after  a  certain  number  of  iterations 
(unless  we  use  an  acceleration  method  such  as  dynamic  Ritz  shifts).  The  first  few 
steps  of  RTR  have  a  difficult  time  exploiting  a  good  preconditioner,  since  efficient 
preconditioned  steps  are  likely  to  be  rejected  for  falling  outside  the  trust  region  [47]. 
Naturally,  it  would  be  benehcial  to  combine  the  strengths  of  these  two  methods. 
In  2004,  Absil,  Baker,  and  Gallivan  found  that  using  a  few  iterations  of  TraceMin 
to  generate  an  initial  subspace  for  RTR  resulted  in  better  convergence  than  either 
method  on  its  own  [47].  However,  they  only  looked  at  one  very  small  eigenvalue 
problem  (that  of  the  Calgary  Olympic  Saddledome  arena),  so  this  must  be  examined 
further.  They  also  neglected  to  specify  a  heuristic  method  of  determining  when  to 
switch  from  TraceMin  to  RTR  iterations.  Now  that  both  TraceMin  and  RTR  are 
implemented  in  Trilinos  sharing  a  common  interface,  we  should  be  able  to  test  the 
combined  TraceMin- RTR  eigensolver  and  determine  whether  RTR  is  a  more  effective 
acceleration  method  than  the  dynamic  Ritz  shifting. 

9.3  Minimizing  the  idle  time  in  TraceMin-Multisectioning 

My  current  Fortran90-based  TraceMin-Multisectioning  implementation  has  a  small 
flaw  that  results  in  extra  idle  time  for  some  MPI  processes,  limiting  its  parallel  seal- 
ability.  MPI  process  0  holds  a  list  of  work  that  is  not  currently  being  processed  by 
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any  other  node;  this  work  (a  set  of  subintervals)  is  held  in  a  stack.  When  another 
MPI  process  requests  a  subinterval  to  either  subdivide  or  run  TraceMin  on,  process  0 
will  give  it  the  top  item  of  the  stack.  It  will  disregard  any  cost  associated  with  that 
subinterval.  As  I  mentioned,  it  is  difficult  to  estimate  the  amount  of  work  running 
TraceMin  on  an  interval  will  require.  However,  it  is  reasonable  to  assume  subdividing 
an  interval  containing  900  eigenvalues  will  be  more  expensive  than  subdividing  an 
interval  containing  50  eigenvalues.  It  is  also  important  to  process  the  interval  con¬ 
taining  900  eigenvalues  first,  since  it  will  create  many  new  subintervals  which  can  be 
treated  as  separate  jobs.  Rather  than  storing  the  additional  jobs  in  a  stack,  we  should 
be  using  a  priority  queue  to  ensure  that  large  subintervals  get  processed  hrst. 

9.4  Removing  TraceMin-Multisectioning’s  dependence  on  a  direct  solver 

In  some  cases,  we  cannot  factor  the  matrix  A  —  aB  to  compute  the  inertia.  Either 
there  is  too  much  hll-in,  or  perhaps  A  and  B  were  not  made  explicitly  available. 
Instead  of  computing  the  exact  inertia  via  an  expensive  factorization,  we  can  use  an 
estimate  of  the  eigenvalue  count  in  an  interval  [50]  obtained  via  relatively  cheap  matrix 
vector  multiplications  and  inner  products  The  only  thing  this  would  change  about 
TraceMin  is  that  it  would  require  resilience  to  incorrect  estimates  of  eigenvalue  counts 
in  the  intervals.  I  will  now  discuss  how  we  could  handle  these  incorrect  estimates. 

If  the  estimate  was  too  large,  i.e.  we  expected  20  eigenvalues  in  the  interval  (a,  b) 
and  there  were  really  only  10  in  that  interval,  TraceMin  can  recover  easily.  Since 
TraceMin  converges  to  the  eigenvalues  closest  to  our  shift  hrst,  we  can  terminate  the 
iterations  as  soon  as  we  converge  to  an  eigenvalue  outside  of  the  interval  (a,  b). 

If  the  estimate  was  too  small,  i.e.  we  expected  20  eigenvalues  in  the  interval 

(a,  b)  when  there  were  really  100,  TraceMin  is  still  capable  of  recovering.  If  TraceMin 

converged  to  20  eigenvalues  and  none  of  them  exist  outside  the  interval  (a,  6),  we 

^In  general,  inner  products  are  relatively  expensive  operations  on  large  parallel  architectures.  Recall 
that  in  this  multisectioning  scheme,  these  inner  products  would  only  take  place  inside  a  single  group 
of  MPI  processes;  we  would  not  be  performing  global  reductions  with  all  MPI  processes. 
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could  assume  that  there  were  more  in  that  interval  and  continue  working  as  follows 
Let  (c,  d)  be  the  smallest  interval  containing  the  20  eigenvalues  we  found.  We  may 
now  treat  (a,  c)  and  {d,  b)  as  new  intervals  of  interest  and  run  a  separate  instance 
of  TraceMin  on  each  of  them.  To  prevent  TraceMin  from  getting  confused  near  the 
edge  of  those  intervals,  we  could  project  out  the  eigenvectors  corresponding  to  the 
eigenvalues  we  found  closest  to  the  boundary  of  (a,  b). 


^If  we  got  lucky  and  the  interval  legitimately  contained  exactly  20  eigenpairs,  we  could  still  perform 
this  procedure.  We  would  simply  run  TraceMin  on  the  two  intervals  (a,  c)  and  (d,  6)  until  they 
converged  to  one  eigenvalue  outside  of  those  intervals,  then  recognize  that  they  were  empty.  This 
would  create  unnecessary  work,  but  it  would  ensure  that  we  did  not  miss  any  eigenvalues. 
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10  SUMMARY 

The  solution  of  sparse  symmetric  eigenvalue  problems  plays  a  significant  role  in  many 
fields  of  computational  science  and  engineering.  Many  eigensolvers  require  accurate 
solutions  of  linear  systems  in  order  to  compnte  the  smallest  eigenpairs,  which  can 
be  infeasible  when  the  matrices  are  large  and  ill-conditioned.  I  presented  a  family 
of  trace-minimization  eigensolvers  which  converge  even  when  the  linear  systems  are 
solved  iteratively  with  a  modest  tolerance. 

First,  we  reviewed  some  applications  that  give  rise  to  sparse  symmetric  eigenvalne 
problems,  snch  as  antomotive  engineering,  condensed  matter  physics,  and  spectral 
reordering.  Then  I  presented  several  methods  of  solving  saddle  point  problems,  the 
most  important  and  time-consnming  kernel  of  these  trace-minimization  methods.  We 
may  either  use  a  projected-Krylov  method,  compute  the  Schur  complement,  or  use  a 
Krylov  method  with  a  block  diagonal  preconditioner. 

After  exploring  how  to  solve  the  saddle  point  problems  arising  at  each  iteration 
of  the  trace-minimization  eigensolvers,  I  described  two  snch  solvers:  TraceMin  and 
TraceMin-Davidson.  TraceMin  constrncts  its  approximate  eigenvectors  from  a  snb- 
space  of  constant  dimension,  whereas  TraceMin-Davidson  uses  expanding  snbspaces. 
We  discussed  the  impact  of  the  block  size  and  distribntion  of  eigenvalues  on  the  overall 
convergence  rate,  as  well  as  how  dynamic  Ritz  shifts  can  improve  the  rate  of  conver¬ 
gence.  We  also  explored  the  impact  of  the  tolerance  used  for  the  inner  Krylov  method 
and  saw  that  these  eigensolvers  can  converge  even  when  very  inexact  solves  are  used, 
unlike  methods  snch  as  simnltaneous  iteration.  In  addition,  we  stndied  how  harmonic 
Ritz  extraction  can  beneht  TraceMin-Davidson  in  hnding  interior  eigenvalnes. 

Next,  I  explained  how  to  solve  several  types  of  problems  with  TraceMin  and 
TraceMin-Davidson.  These  eigensolvers  are  designed  to  compnte  the  smallest  magni¬ 
tude  eigenvalues  of  a  symmetric  eigenvalue  problem,  but  we  can  cause  them  to  target 
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alternate  eigenpairs  through  the  use  of  spectral  transformations.  I  also  presented 
two  additional  implementations  referred  to  as  TraceMin-Sampling  and  TraceMin- 
Multisectioning.  TraceMin-Sampling  is  designed  to  compute  a  few  eigenpairs  near 
a  large  set  of  shifts,  and  TraceMin-Multisectioning  was  created  to  compute  all  the 
eigenpairs  in  an  interval.  My  multisectioning  implementation  uses  a  method  simi¬ 
lar  to  adaptive  quadrature  to  subdivide  the  large  global  interval  into  many  smaller 
subintervals  which  can  be  processed  independently. 

I  then  described  several  other  popular  eigensolvers  such  as  Krylov-Schur,  LOBPCG, 
Jacobi-Davidson,  the  Riemannian  Trust  Region  method,  and  FEAST.  After  estab¬ 
lishing  how  the  competing  eigensolvers  work,  I  presented  comparisons  between  those 
eigensolvers  and  my  trace-minimization  solvers.  The  results  showed  that  TraceMin 
and  TraceMin-Davidson  are  very  robust,  and  my  implementations  are  quite  compet¬ 
itive  with  the  leading  software  packages  in  terms  of  speed  and  scalabiltiy. 
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